date:20240102

Re: [PR] Spark: Added merge schema as spark configuration [iceberg]

2024-01-02 Thread via GitHub



Aleena-M-Georgy closed pull request #9397: Spark: Added merge schema as spark 
configuration
URL: https://github.com/apache/iceberg/pull/9397


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Apache Flink not committing new snapshots to Iceberg Table [iceberg]

2024-01-02 Thread via GitHub



FranMorilloAWS commented on issue #9089:
URL: https://github.com/apache/iceberg/issues/9089#issuecomment-1873872091

   Any thoughts on why this could be happening?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Adding Junit 5 conversion and AssertJ style for TestFlinkCatalogTable… [iceberg]

2024-01-02 Thread via GitHub



vinitpatni commented on PR #9381:
URL: https://github.com/apache/iceberg/pull/9381#issuecomment-1874003109

   - Adding Junit 5 conversion and AssertJ style for TestFlinkUpsert and 
TestRewriteDataFilesAction


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Adding Junit 5 conversion and AssertJ style for TestFlinkCatalogTable… [iceberg]

2024-01-02 Thread via GitHub



vinitpatni commented on PR #9381:
URL: https://github.com/apache/iceberg/pull/9381#issuecomment-1874010083

   @nastra TestStreamScanSql is the only subclass remaining for conversion to 
Junit 5 but it has dependency on GenericAppenderHelper class which is part of 
iceberg-data module. Let me know how to tackle that ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Create JUnit5-version of FlinkCatalogTestBase [iceberg]

2024-01-02 Thread via GitHub



vinitpatni commented on issue #9079:
URL: https://github.com/apache/iceberg/issues/9079#issuecomment-1874038839

   Following is the active PR for this one : 
https://github.com/apache/iceberg/pull/9381


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Schema IDs Re-Order? [iceberg-python]

2024-01-02 Thread via GitHub



sebpretzer closed issue #229: Schema IDs Re-Order?
URL: https://github.com/apache/iceberg-python/issues/229


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Schema IDs Re-Order? [iceberg-python]

2024-01-02 Thread via GitHub



sebpretzer commented on issue #229:
URL: https://github.com/apache/iceberg-python/issues/229#issuecomment-1874187979

   @Fokko To follow up, we have fixed this internally ourselves with something 
similar to below:
   ```python
   from pydantic import BaseModel, field_validator
   from pyiceberg.schema import Schema, assign_fresh_schema_ids
   
   class FooBarModel(BaseModel)
   iceberg: Schema
   
   @field_validator("iceberg")
   def field_ids_in_preorder(cls, v: Schema) -> Schema:
   if v != assign_fresh_schema_ids(v):
   raise ValueError("Schema ids not in preorder")
   return v
   ```
   It places a bit more burden on users to properly order the field ids, but 
the schema will no longer change underneath them. If you want something similar 
upstream in `Schema`, I am happy to take a crack at it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439572133


##
core/src/main/java/org/apache/iceberg/io/ClusteredPositionDeleteWriter.java:
##
@@ -46,17 +49,39 @@ public ClusteredPositionDeleteWriter(
   OutputFileFactory fileFactory,
   FileIO io,
   long targetFileSizeInBytes) {
+this(writerFactory, fileFactory, io, targetFileSizeInBytes, 
DeleteGranularity.PARTITION);
+  }
+
+  public ClusteredPositionDeleteWriter(
+  FileWriterFactory writerFactory,
+  OutputFileFactory fileFactory,
+  FileIO io,
+  long targetFileSizeInBytes,
+  DeleteGranularity granularity) {
 this.writerFactory = writerFactory;
 this.fileFactory = fileFactory;
 this.io = io;
 this.targetFileSizeInBytes = targetFileSizeInBytes;
+this.granularity = granularity;
 this.deleteFiles = Lists.newArrayList();
 this.referencedDataFiles = CharSequenceSet.empty();
   }
 
   @Override
   protected FileWriter, DeleteWriteResult> newWriter(
   PartitionSpec spec, StructLike partition) {
+switch (granularity) {
+  case FILE:
+return new TargetedPositionDeleteWriter<>(() -> newRollingWriter(spec, 
partition));

Review Comment:
   We can actually roll correctly here because this is the "clustered" path. We 
are not going to use the sorting writer and will not buffer deletes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439572133


##
core/src/main/java/org/apache/iceberg/io/ClusteredPositionDeleteWriter.java:
##
@@ -46,17 +49,39 @@ public ClusteredPositionDeleteWriter(
   OutputFileFactory fileFactory,
   FileIO io,
   long targetFileSizeInBytes) {
+this(writerFactory, fileFactory, io, targetFileSizeInBytes, 
DeleteGranularity.PARTITION);
+  }
+
+  public ClusteredPositionDeleteWriter(
+  FileWriterFactory writerFactory,
+  OutputFileFactory fileFactory,
+  FileIO io,
+  long targetFileSizeInBytes,
+  DeleteGranularity granularity) {
 this.writerFactory = writerFactory;
 this.fileFactory = fileFactory;
 this.io = io;
 this.targetFileSizeInBytes = targetFileSizeInBytes;
+this.granularity = granularity;
 this.deleteFiles = Lists.newArrayList();
 this.referencedDataFiles = CharSequenceSet.empty();
   }
 
   @Override
   protected FileWriter, DeleteWriteResult> newWriter(
   PartitionSpec spec, StructLike partition) {
+switch (granularity) {
+  case FILE:
+return new TargetedPositionDeleteWriter<>(() -> newRollingWriter(spec, 
partition));

Review Comment:
   We can actually roll correctly here because this is the "clustered" path. We 
are not going to use the sorting writer and will not buffer deletes. We can 
also roll correctly in the "fanout" path cause the sorting writer acts as a 
wrapper on top of the rolling writer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439581834


##
core/src/main/java/org/apache/iceberg/deletes/TargetedPositionDeleteWriter.java:
##
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.deletes;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.function.Supplier;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.io.DeleteWriteResult;
+import org.apache.iceberg.io.FileWriter;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.util.CharSequenceSet;
+
+/**
+ * A position delete writer that produces a separate delete file for each 
referenced data file.
+ *
+ * This writer does not keep track of seen deletes and assumes all incoming 
records are ordered
+ * by file and position as required by the spec. If there is no external 
process to order the
+ * records, consider using {@link SortingPositionOnlyDeleteWriter} instead.
+ */
+public class TargetedPositionDeleteWriter
+implements FileWriter, DeleteWriteResult> {
+
+  private final Supplier, DeleteWriteResult>> 
writers;
+  private final List deleteFiles;
+  private final CharSequenceSet referencedDataFiles;
+
+  private FileWriter, DeleteWriteResult> currentWriter = 
null;
+  private CharSequence currentPath = null;
+  private boolean closed = false;
+
+  public TargetedPositionDeleteWriter(
+  Supplier, DeleteWriteResult>> writers) {
+this.writers = writers;
+this.deleteFiles = Lists.newArrayList();
+this.referencedDataFiles = CharSequenceSet.empty();
+  }
+
+  @Override
+  public void write(PositionDelete positionDelete) {
+writer(positionDelete.path()).write(positionDelete);
+  }
+
+  private FileWriter, DeleteWriteResult> writer(CharSequence 
path) {
+if (currentWriter == null) {
+  openCurrentWriter(path);
+} else if (unequal(currentPath, path)) {
+  closeCurrentWriter();
+  openCurrentWriter(path);
+}
+
+return currentWriter;
+  }
+
+  @Override
+  public long length() {
+throw new UnsupportedOperationException(getClass().getName() + " does not 
implement length");

Review Comment:
   We don't need cause this writer wraps the rolling writer, not the other way 
around.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439583443


##
core/src/main/java/org/apache/iceberg/TableProperties.java:
##
@@ -334,6 +335,9 @@ private TableProperties() {}
   public static final String MAX_REF_AGE_MS = "history.expire.max-ref-age-ms";
   public static final long MAX_REF_AGE_MS_DEFAULT = Long.MAX_VALUE;
 
+  public static final String DELETE_GRANULARITY = "write.delete.granularity";
+  public static final String DELETE_GRANULARITY_DEFAULT = 
DeleteGranularity.PARTITION.toString();

Review Comment:
   I started with a string constant but then saw what we did for 
`RowLevelOperationMode` and decided to follow that for consistency.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439592489


##
core/src/main/java/org/apache/iceberg/deletes/SortingPositionOnlyDeleteWriter.java:
##
@@ -60,7 +72,7 @@ public void write(PositionDelete positionDelete) {
 
   @Override
   public long length() {
-return writer.length();
+return result != null ? result.totalFileSizeInBytes() : 0L;

Review Comment:
   I'll probably switch to not implementing it at all, just like we do in the 
other writer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439605409


##
core/src/main/java/org/apache/iceberg/deletes/TargetedPositionDeleteWriter.java:
##
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.deletes;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.function.Supplier;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.io.DeleteWriteResult;
+import org.apache.iceberg.io.FileWriter;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.util.CharSequenceSet;
+
+/**
+ * A position delete writer that produces a separate delete file for each 
referenced data file.
+ *
+ * This writer does not keep track of seen deletes and assumes all incoming 
records are ordered
+ * by file and position as required by the spec. If there is no external 
process to order the
+ * records, consider using {@link SortingPositionOnlyDeleteWriter} instead.
+ */
+public class TargetedPositionDeleteWriter

Review Comment:
   I believe `Clustered` is something we use for `PartitioningWriter` 
implementations to indicate that incoming records are grouped by spec and 
partition. If we use that prefix in this context, it may be a bit misleading.
   
   I renamed this class to `FileScopedPositionDeleteWriter`. Let me know what 
you think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.5: Support filtering with buckets in RewriteDataFilesProcedure [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer commented on PR #9396:
URL: https://github.com/apache/iceberg/pull/9396#issuecomment-1874242699

   Why does this need a new api?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Spark: Correct the delete record count for PartitionTable [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer merged PR #9389:
URL: https://github.com/apache/iceberg/pull/9389


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Spark: Correct the delete record count for PartitionTable [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer commented on PR #9389:
URL: https://github.com/apache/iceberg/pull/9389#issuecomment-1874247332

   LGTM! Thanks @ConeyLiu for the PR and @singhpk234 for the review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439622150


##
core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.deletes;
+
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+/**
+ * An enum that represents the granularity of deletes.
+ *
+ * Under partition granularity, delete writers are allowed to group deletes 
for different data

Review Comment:
   "are allowed"? Perhaps maybe we should say something like "are directed to 
group deletes". I think the text in this doc goes a bit back and forth between 
saying that the delete writers will do something and the delete writers may do 
something.
   
   I think it may also help to kind of express these as (Many data files -> One 
Delete file) and (One data file -> One Delete File) or something like that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark Streaming: Fix clobbering of files across streaming epochs [iceberg]

2024-01-02 Thread via GitHub



rdblue merged PR #9255:
URL: https://github.com/apache/iceberg/pull/9255


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Parquet file overwritten by spark streaming job in subsequent execution with same spark streaming checkpoint location [iceberg]

2024-01-02 Thread via GitHub



rdblue closed issue #9172: Parquet file overwritten by spark streaming job in 
subsequent execution with same spark streaming checkpoint location
URL: https://github.com/apache/iceberg/issues/9172


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2024-01-02 Thread via GitHub



rdblue closed issue #8953: Duplicate file name in Iceberg's metadata
URL: https://github.com/apache/iceberg/issues/8953


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439631070


##
core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.deletes;
+
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+/**
+ * An enum that represents the granularity of deletes.
+ *
+ * Under partition granularity, delete writers are allowed to group deletes 
for different data
+ * files into one delete file. This strategy tends to reduce the total number 
of delete files in the
+ * table. However, it may lead to the assignment of irrelevant deletes to some 
data files during the

Review Comment:
   Potential Rewrite? Trying to make this a but more directly worded
   
   However, a scan for a single data file will require reading delete 
information for multiple data files in the partition even if those other files 
are not required for the scan. This information will be ignored during the 
reads but reading this extra delete information will cause overhead. The 
overhead can potentially be mitigated via delete file caching (link here?).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439633400


##
core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.deletes;
+
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+/**
+ * An enum that represents the granularity of deletes.
+ *
+ * Under partition granularity, delete writers are allowed to group deletes 
for different data
+ * files into one delete file. This strategy tends to reduce the total number 
of delete files in the
+ * table. However, it may lead to the assignment of irrelevant deletes to some 
data files during the
+ * job planning. All irrelevant deletes are filtered by readers but add extra 
overhead, which can be
+ * mitigated via caching.
+ *
+ * Under file granularity, delete writers always organize deletes by their 
target data file,
+ * creating separate delete files for each referenced data file. This strategy 
ensures the job
+ * planning does not assign irrelevant deletes to data files. However, it also 
increases the total

Review Comment:
   "to data files which means no possibly extranousious delete information will 
be read unlike in `partition` granularity"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439633731


##
core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.deletes;
+
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+/**
+ * An enum that represents the granularity of deletes.
+ *
+ * Under partition granularity, delete writers are allowed to group deletes 
for different data
+ * files into one delete file. This strategy tends to reduce the total number 
of delete files in the
+ * table. However, it may lead to the assignment of irrelevant deletes to some 
data files during the
+ * job planning. All irrelevant deletes are filtered by readers but add extra 
overhead, which can be
+ * mitigated via caching.
+ *
+ * Under file granularity, delete writers always organize deletes by their 
target data file,
+ * creating separate delete files for each referenced data file. This strategy 
ensures the job
+ * planning does not assign irrelevant deletes to data files. However, it also 
increases the total
+ * number of delete files in the table and may require a more aggressive 
approach for delete file
+ * compaction.
+ *
+ * Currently, this configuration is only applicable to position deletes.
+ *
+ * Each granularity has its own benefits and drawbacks and should be picked 
based on a use case.
+ * Despite the chosen granularity, regular delete compaction remains 
necessary. It is also possible

Review Comment:
   Despite -> Regardless of the



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439633731


##
core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.deletes;
+
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+/**
+ * An enum that represents the granularity of deletes.
+ *
+ * Under partition granularity, delete writers are allowed to group deletes 
for different data
+ * files into one delete file. This strategy tends to reduce the total number 
of delete files in the
+ * table. However, it may lead to the assignment of irrelevant deletes to some 
data files during the
+ * job planning. All irrelevant deletes are filtered by readers but add extra 
overhead, which can be
+ * mitigated via caching.
+ *
+ * Under file granularity, delete writers always organize deletes by their 
target data file,
+ * creating separate delete files for each referenced data file. This strategy 
ensures the job
+ * planning does not assign irrelevant deletes to data files. However, it also 
increases the total
+ * number of delete files in the table and may require a more aggressive 
approach for delete file
+ * compaction.
+ *
+ * Currently, this configuration is only applicable to position deletes.
+ *
+ * Each granularity has its own benefits and drawbacks and should be picked 
based on a use case.
+ * Despite the chosen granularity, regular delete compaction remains 
necessary. It is also possible

Review Comment:
   Despite -> Regardless of the
   
   or maybe
   "Regular delete compaction is still required regardless of which granularity 
is chosen."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439642290


##
core/src/main/java/org/apache/iceberg/deletes/TargetedPositionDeleteWriter.java:
##
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.deletes;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.function.Supplier;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.io.DeleteWriteResult;
+import org.apache.iceberg.io.FileWriter;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.util.CharSequenceSet;
+
+/**
+ * A position delete writer that produces a separate delete file for each 
referenced data file.
+ *
+ * This writer does not keep track of seen deletes and assumes all incoming 
records are ordered
+ * by file and position as required by the spec. If there is no external 
process to order the
+ * records, consider using {@link SortingPositionOnlyDeleteWriter} instead.
+ */
+public class TargetedPositionDeleteWriter

Review Comment:
   I think FileScoped or if you want a whole new name 
PerFilePostionDeleteFileWriter?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Bump Nessie to 0.76.0 [iceberg]

2024-01-02 Thread via GitHub



jbonofre commented on PR #9398:
URL: https://github.com/apache/iceberg/pull/9398#issuecomment-1874300845

   LGTM, @rdblue @nastra I would like to include for Iceberg 1.5.0. Thoughts ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Bump Nessie to 0.76.0 [iceberg]

2024-01-02 Thread via GitHub



jbonofre commented on PR #9398:
URL: https://github.com/apache/iceberg/pull/9398#issuecomment-1874301447

   Maybe worth to have a corresponding issue to populate the changelog/release 
notes. I will do it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Replace black by Ruff Formatter [iceberg-python]

2024-01-02 Thread via GitHub



rdblue commented on PR #127:
URL: https://github.com/apache/iceberg-python/pull/127#issuecomment-1874330644

   Looks good to me now. Thanks, @hussein-awala!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Replace black by Ruff Formatter [iceberg-python]

2024-01-02 Thread via GitHub



hussein-awala commented on PR #127:
URL: https://github.com/apache/iceberg-python/pull/127#issuecomment-1874338838

   I just merged main and fixed the conflicts, it should be ready to merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Data: Allow classes of different packages to implement DeleteFilter [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer commented on code in PR #9352:
URL: https://github.com/apache/iceberg/pull/9352#discussion_r1439676088


##
core/src/main/java/org/apache/iceberg/util/DecimalUtil.java:
##
@@ -31,7 +31,7 @@ private DecimalUtil() {}
   public static byte[] toReusedFixLengthBytes(
   int precision, int scale, BigDecimal decimal, byte[] reuseBuf) {
 Preconditions.checkArgument(
-decimal.scale() == scale,
+decimal.scale() <= scale,

Review Comment:
   Is this safe? It is re-use fix length bytes but in this case we aren't using 
all the bytes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Data: Allow classes of different packages to implement DeleteFilter [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer commented on code in PR #9352:
URL: https://github.com/apache/iceberg/pull/9352#discussion_r1439676343


##
data/src/main/java/org/apache/iceberg/data/DeleteFilter.java:
##
@@ -138,7 +138,7 @@ public void incrementDeleteCount() {
 counter.increment();
   }
 
-  Accessor posAccessor() {
+  public Accessor posAccessor() {

Review Comment:
   Why do we want to open this up?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Data: Allow classes of different packages to implement DeleteFilter [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer commented on PR #9352:
URL: https://github.com/apache/iceberg/pull/9352#issuecomment-1874346597

   This seem like two different issues attempted to being solved in 1 PR. 
Please Separate them. I think you also need to offer a much more detailed 
explanation of why you want to make an api Public.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439678214


##
core/src/main/java/org/apache/iceberg/deletes/TargetedPositionDeleteWriter.java:
##
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.deletes;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.function.Supplier;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.io.DeleteWriteResult;
+import org.apache.iceberg.io.FileWriter;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.util.CharSequenceSet;
+
+/**
+ * A position delete writer that produces a separate delete file for each 
referenced data file.
+ *
+ * This writer does not keep track of seen deletes and assumes all incoming 
records are ordered
+ * by file and position as required by the spec. If there is no external 
process to order the
+ * records, consider using {@link SortingPositionOnlyDeleteWriter} instead.
+ */
+public class TargetedPositionDeleteWriter
+implements FileWriter, DeleteWriteResult> {
+
+  private final Supplier, DeleteWriteResult>> 
writers;
+  private final List deleteFiles;
+  private final CharSequenceSet referencedDataFiles;
+
+  private FileWriter, DeleteWriteResult> currentWriter = 
null;
+  private CharSequence currentPath = null;
+  private boolean closed = false;
+
+  public TargetedPositionDeleteWriter(
+  Supplier, DeleteWriteResult>> writers) {
+this.writers = writers;
+this.deleteFiles = Lists.newArrayList();
+this.referencedDataFiles = CharSequenceSet.empty();
+  }
+
+  @Override
+  public void write(PositionDelete positionDelete) {
+writer(positionDelete.path()).write(positionDelete);
+  }
+
+  private FileWriter, DeleteWriteResult> writer(CharSequence 
path) {
+if (currentWriter == null) {
+  openCurrentWriter(path);
+} else if (unequal(currentPath, path)) {
+  closeCurrentWriter();
+  openCurrentWriter(path);
+}
+
+return currentWriter;
+  }
+
+  @Override
+  public long length() {
+throw new UnsupportedOperationException(getClass().getName() + " does not 
implement length");
+  }
+
+  @Override
+  public DeleteWriteResult result() {
+Preconditions.checkState(closed, "Cannot get result from unclosed writer");
+return new DeleteWriteResult(deleteFiles, referencedDataFiles);
+  }
+
+  @Override
+  public void close() throws IOException {
+if (!closed) {
+  closeCurrentWriter();
+  this.closed = true;
+}
+  }
+
+  private void openCurrentWriter(CharSequence path) {
+Preconditions.checkState(!closed, "Writer has already been closed");
+this.currentWriter = writers.get();
+this.currentPath = path;
+  }
+
+  private void closeCurrentWriter() {
+if (currentWriter != null) {
+  try {
+currentWriter.close();
+DeleteWriteResult result = currentWriter.result();
+deleteFiles.addAll(result.deleteFiles());
+referencedDataFiles.addAll(result.referencedDataFiles());
+this.currentWriter = null;
+this.currentPath = null;
+  } catch (IOException e) {
+throw new UncheckedIOException("Failed to close current writer", e);
+  }
+}
+  }
+
+  private static boolean unequal(CharSequence s1, CharSequence s2) {

Review Comment:
   Our `CharSeqComparator` is private. I've put this into a new utility class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439678391


##
core/src/main/java/org/apache/iceberg/TableProperties.java:
##
@@ -334,6 +335,9 @@ private TableProperties() {}
   public static final String MAX_REF_AGE_MS = "history.expire.max-ref-age-ms";
   public static final long MAX_REF_AGE_MS_DEFAULT = Long.MAX_VALUE;
 
+  public static final String DELETE_GRANULARITY = "write.delete.granularity";

Review Comment:
   Added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#issuecomment-1874349288

   > One question:
   > Iceberg has the rewritePositionDeletesAction. Will this pr influence this 
action?
   
   @jerqi, yes, it will. There is a new test in 
`TestRewritePositionDeleteFilesAction`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439679439


##
core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.deletes;
+
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+/**
+ * An enum that represents the granularity of deletes.
+ *
+ * Under partition granularity, delete writers are allowed to group deletes 
for different data

Review Comment:
   Makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439679682


##
core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.deletes;
+
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+/**
+ * An enum that represents the granularity of deletes.
+ *
+ * Under partition granularity, delete writers are allowed to group deletes 
for different data
+ * files into one delete file. This strategy tends to reduce the total number 
of delete files in the
+ * table. However, it may lead to the assignment of irrelevant deletes to some 
data files during the

Review Comment:
   I like it, incorporated. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Bump Nessie to 0.76.0 [iceberg]

2024-01-02 Thread via GitHub



snazy commented on PR #9398:
URL: https://github.com/apache/iceberg/pull/9398#issuecomment-1874350376

   The build-issue (fixed by the 2nd commit) is caused by a Java 21 class file, 
properly placed in `META-INF/versions/21/com/fasterxml/jackson/...`. There is 
yet no shadow-plugin version that can deal with Java 21, except a shadow plugin 
in a different fork. The "original" 
[shadow-plugin's](https://github.com/johnrengelman/shadow/) been unmaintained 
since March/April 2023 - there's [a fork](https://github.com/Goooler/shadow) 
that has changes to support Java 21 and Gradle 8.5+.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439679853


##
core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.deletes;
+
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+/**
+ * An enum that represents the granularity of deletes.
+ *
+ * Under partition granularity, delete writers are allowed to group deletes 
for different data
+ * files into one delete file. This strategy tends to reduce the total number 
of delete files in the
+ * table. However, it may lead to the assignment of irrelevant deletes to some 
data files during the
+ * job planning. All irrelevant deletes are filtered by readers but add extra 
overhead, which can be
+ * mitigated via caching.
+ *
+ * Under file granularity, delete writers always organize deletes by their 
target data file,
+ * creating separate delete files for each referenced data file. This strategy 
ensures the job
+ * planning does not assign irrelevant deletes to data files. However, it also 
increases the total
+ * number of delete files in the table and may require a more aggressive 
approach for delete file
+ * compaction.
+ *
+ * Currently, this configuration is only applicable to position deletes.
+ *
+ * Each granularity has its own benefits and drawbacks and should be picked 
based on a use case.
+ * Despite the chosen granularity, regular delete compaction remains 
necessary. It is also possible

Review Comment:
   Switched.



##
core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.deletes;
+
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+/**
+ * An enum that represents the granularity of deletes.
+ *
+ * Under partition granularity, delete writers are allowed to group deletes 
for different data
+ * files into one delete file. This strategy tends to reduce the total number 
of delete files in the
+ * table. However, it may lead to the assignment of irrelevant deletes to some 
data files during the
+ * job planning. All irrelevant deletes are filtered by readers but add extra 
overhead, which can be
+ * mitigated via caching.
+ *
+ * Under file granularity, delete writers always organize deletes by their 
target data file,
+ * creating separate delete files for each referenced data file. This strategy 
ensures the job
+ * planning does not assign irrelevant deletes to data files. However, it also 
increases the total

Review Comment:
   Rewrote this part as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API: Fix day partition transform result type [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer commented on code in PR #9345:
URL: https://github.com/apache/iceberg/pull/9345#discussion_r1439682666


##
format/spec.md:
##
@@ -318,7 +318,7 @@ Partition field IDs must be reused if an existing partition 
spec contains an equ
 | **`truncate[W]`** | Value truncated to width `W` (see below) 
| `int`, `long`, `decimal`, `string`
| Source type |
 | **`year`**| Extract a date or timestamp year, as years from 1970 
| `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`  
| `int`   |
 | **`month`**   | Extract a date or timestamp month, as months from 
1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, 
`timestamptz_ns`  | `int`   |
-| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 
| `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`  
| `int`   |
+| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 
| `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`  
| `date`  |

Review Comment:
   Agree with Fokko, this is a description of the transform which takes a date 
or timestamp and produces and int. If you are implementing this transform and 
return the implementation would be incorrect. Now if you implement this 
function and display the output as a different type (like the metadata table) 
that's fine. 



##
format/spec.md:
##
@@ -318,7 +318,7 @@ Partition field IDs must be reused if an existing partition 
spec contains an equ
 | **`truncate[W]`** | Value truncated to width `W` (see below) 
| `int`, `long`, `decimal`, `string`
| Source type |
 | **`year`**| Extract a date or timestamp year, as years from 1970 
| `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`  
| `int`   |
 | **`month`**   | Extract a date or timestamp month, as months from 
1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, 
`timestamptz_ns`  | `int`   |
-| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 
| `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`  
| `int`   |
+| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 
| `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`  
| `date`  |

Review Comment:
   Agree with Fokko, this is a description of the transform which takes a date 
or timestamp and produces an int. If you are implementing this transform and 
return the implementation would be incorrect. Now if you implement this 
function and display the output as a different type (like the metadata table) 
that's fine. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API: Fix day partition transform result type [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer commented on code in PR #9345:
URL: https://github.com/apache/iceberg/pull/9345#discussion_r1439682666


##
format/spec.md:
##
@@ -318,7 +318,7 @@ Partition field IDs must be reused if an existing partition 
spec contains an equ
 | **`truncate[W]`** | Value truncated to width `W` (see below) 
| `int`, `long`, `decimal`, `string`
| Source type |
 | **`year`**| Extract a date or timestamp year, as years from 1970 
| `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`  
| `int`   |
 | **`month`**   | Extract a date or timestamp month, as months from 
1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, 
`timestamptz_ns`  | `int`   |
-| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 
| `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`  
| `int`   |
+| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 
| `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`  
| `date`  |

Review Comment:
   Agree with Fokko, this is a description of the transform which takes a date 
or timestamp and produces an int. If you are implementing this transform and 
return a date the transform would be incorrect and incompatible with other 
Iceberg implementations. Now if you implement this function and display the 
output as a different type (like the metadata table) that's fine. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439685145


##
core/src/main/java/org/apache/iceberg/TableProperties.java:
##
@@ -334,6 +335,9 @@ private TableProperties() {}
   public static final String MAX_REF_AGE_MS = "history.expire.max-ref-age-ms";
   public static final long MAX_REF_AGE_MS_DEFAULT = Long.MAX_VALUE;
 
+  public static final String DELETE_GRANULARITY = "write.delete.granularity";

Review Comment:
   To be honest, I doubt we will ever support this property for equality 
deletes.
   
   In general, I do get that we may want to configure position and equality 
deletes differently. We can explore adding an extra namespace. I am still not 
sure this use case falls into that bucket.
   
   @rdblue @RussellSpitzer @zhongyujiang, thoughts? Do we want a prefix for 
this config to make it explicit that it only applies to position deletes? 
Currently, I only note that in the docs.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Add ManifestWrite benchmark [iceberg]

2024-01-02 Thread via GitHub



dramaticlly commented on code in PR #8637:
URL: https://github.com/apache/iceberg/pull/8637#discussion_r1439696239


##
core/src/jmh/java/org/apache/iceberg/ManifestWriteBenchmark.java:
##
@@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.File;
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.nio.ByteBuffer;
+import java.util.Map;
+import java.util.Random;
+import java.util.UUID;
+import java.util.concurrent.TimeUnit;
+import org.apache.commons.io.FileUtils;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.relocated.com.google.common.io.Files;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.TearDown;
+import org.openjdk.jmh.annotations.Threads;
+import org.openjdk.jmh.annotations.Timeout;
+
+/**
+ * A benchmark that evaluates the performance of writing manifest files
+ *
+ * To run this benchmark: 
+ *   ./gradlew :iceberg-core:jmh -PjmhIncludeRegex=ManifestWriteBenchmark
+ * 
+ */
+@Fork(1)
+@State(Scope.Benchmark)
+@Measurement(iterations = 5)
+@BenchmarkMode(Mode.SingleShotTime)
+@Timeout(time = 5, timeUnit = TimeUnit.MINUTES)
+public class ManifestWriteBenchmark {
+
+  private static final int NUM_FILES = 10;
+  private static final int NUM_ROWS = 10;
+  private static final int NUM_COLS = 100;
+
+  private String baseDir;
+  private String manifestListFile;
+
+  private Metrics metrics;
+
+  @Setup
+  public void before() {
+Random random = new Random(System.currentTimeMillis());
+// Pre-create the metrics to avoid doing this in the benchmark itself
+metrics = randomMetrics(random);
+  }
+
+  @TearDown
+  public void after() {
+if (baseDir != null) {
+  FileUtils.deleteQuietly(new File(baseDir));
+  baseDir = null;
+}
+
+manifestListFile = null;
+  }
+
+  @State(Scope.Benchmark)
+  public static class BenchmarkState {
+@Param({"1", "2"})
+public int formatVersion;

Review Comment:
   ```suggestion
   int formatVersion;
   ```
   
   checkstyle does not like it and I believe it does not need to be public if 
we always access it via state



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer commented on code in PR #8755:
URL: https://github.com/apache/iceberg/pull/8755#discussion_r1439698943


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkExecutorCache.java:
##
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark;
+
+import com.github.benmanes.caffeine.cache.Cache;
+import com.github.benmanes.caffeine.cache.Caffeine;
+import java.time.Duration;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.Map;
+import java.util.function.Supplier;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.relocated.com.google.common.collect.Queues;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * An executor cache for optimizing tasks by reducing the computation and IO 
overhead.
+ *
+ * The cache is configurable and enabled through Spark SQL properties. Its 
key features include
+ * setting limits on the total cache size and maximum size for individual 
entries. Additionally, it
+ * implements automatic eviction of entries after a specified duration of 
inactivity. The cache will
+ * respect the SQL configuration valid at the time of initialization. All 
subsequent changes will
+ * have no effect.
+ *
+ * Usage pattern involves fetching data from the cache using a unique 
combination of execution ID
+ * and key. If the data is not present in the cache, it is computed using the 
provided supplier and
+ * stored in the cache, subject to the defined size constraints.
+ *
+ * Note that this class employs the singleton pattern to ensure only one 
cache exists per JVM.
+ *
+ * @see SparkUtil#executionId()
+ */
+public class SparkExecutorCache {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(SparkExecutorCache.class);
+
+  private static final SparkConfParser CONF_PARSER = new SparkConfParser();
+  private static final boolean CACHE_ENABLED = parseCacheEnabledConf();
+  private static final Duration TIMEOUT = parseTimeoutConf();
+  private static final long MAX_ENTRY_SIZE = parseMaxEntrySizeConf();
+  private static final long MAX_TOTAL_SIZE = parseMaxTotalSizeConf();
+  private static final String EXECUTOR_DESC = SparkUtil.executorDesc();
+
+  private static volatile SparkExecutorCache instance = null;
+
+  private final Map> keysByExecutionId;
+  private volatile Cache cache = null;
+
+  private SparkExecutorCache() {
+this.keysByExecutionId = Collections.synchronizedMap(Maps.newHashMap());
+  }
+
+  public static SparkExecutorCache getOrCreate() {
+if (instance == null && CACHE_ENABLED) {
+  synchronized (SparkExecutorCache.class) {
+if (instance == null) {
+  SparkExecutorCache.instance = new SparkExecutorCache();
+}
+  }
+}
+
+return instance;
+  }
+
+  public static void cleanUp(String executionId) {
+if (instance != null) {

Review Comment:
   Should we throw an error if instance == null and cleanUp gets called?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]

2024-01-02 Thread via GitHub



RussellSpitzer commented on code in PR #8755:
URL: https://github.com/apache/iceberg/pull/8755#discussion_r1439702880


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkExecutorCache.java:
##
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark;
+
+import com.github.benmanes.caffeine.cache.Cache;
+import com.github.benmanes.caffeine.cache.Caffeine;
+import java.time.Duration;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.Map;
+import java.util.function.Supplier;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.relocated.com.google.common.collect.Queues;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * An executor cache for optimizing tasks by reducing the computation and IO 
overhead.
+ *
+ * The cache is configurable and enabled through Spark SQL properties. Its 
key features include
+ * setting limits on the total cache size and maximum size for individual 
entries. Additionally, it
+ * implements automatic eviction of entries after a specified duration of 
inactivity. The cache will
+ * respect the SQL configuration valid at the time of initialization. All 
subsequent changes will
+ * have no effect.
+ *
+ * Usage pattern involves fetching data from the cache using a unique 
combination of execution ID
+ * and key. If the data is not present in the cache, it is computed using the 
provided supplier and
+ * stored in the cache, subject to the defined size constraints.
+ *
+ * Note that this class employs the singleton pattern to ensure only one 
cache exists per JVM.
+ *
+ * @see SparkUtil#executionId()
+ */
+public class SparkExecutorCache {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(SparkExecutorCache.class);
+
+  private static final SparkConfParser CONF_PARSER = new SparkConfParser();
+  private static final boolean CACHE_ENABLED = parseCacheEnabledConf();
+  private static final Duration TIMEOUT = parseTimeoutConf();
+  private static final long MAX_ENTRY_SIZE = parseMaxEntrySizeConf();
+  private static final long MAX_TOTAL_SIZE = parseMaxTotalSizeConf();
+  private static final String EXECUTOR_DESC = SparkUtil.executorDesc();
+
+  private static volatile SparkExecutorCache instance = null;
+
+  private final Map> keysByExecutionId;
+  private volatile Cache cache = null;
+
+  private SparkExecutorCache() {
+this.keysByExecutionId = Collections.synchronizedMap(Maps.newHashMap());
+  }
+
+  public static SparkExecutorCache getOrCreate() {
+if (instance == null && CACHE_ENABLED) {
+  synchronized (SparkExecutorCache.class) {
+if (instance == null) {
+  SparkExecutorCache.instance = new SparkExecutorCache();
+}
+  }
+}
+
+return instance;
+  }
+
+  public static void cleanUp(String executionId) {
+if (instance != null) {
+  instance.invalidate(executionId);
+}
+  }
+
+  public long maxEntrySize() {
+return MAX_ENTRY_SIZE;
+  }
+
+  public  V get(String executionId, String key, Supplier valueSupplier, 
long valueSize) {
+if (valueSize > MAX_ENTRY_SIZE) {

Review Comment:
   I'm a little confused here, if I try to get the cache here for this 
executionId and key, and it already is set in the cache I may still error out.
   
   For example
   
   get(id_1, key_1) - with 64mb Cache Entry // Inits cache
   get(id_1, key_1) - with same 64mb Cache entry // Fails although value is 
already set



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



rdblue commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439722314


##
core/src/main/java/org/apache/iceberg/TableProperties.java:
##
@@ -334,6 +335,9 @@ private TableProperties() {}
   public static final String MAX_REF_AGE_MS = "history.expire.max-ref-age-ms";
   public static final long MAX_REF_AGE_MS_DEFAULT = Long.MAX_VALUE;
 
+  public static final String DELETE_GRANULARITY = "write.delete.granularity";

Review Comment:
   This option makes no sense for equality deletes because they aren't targeted 
at a single file, so I agree that we won't support it for equality. This is 
also mostly advisory. It is unlikely that we will support it in Flink and will 
instead always use file-level granularity. Maybe we won't even want to support 
this in the long term, if we decide that Spark performs better with file 
granularity at all times.
   
   I guess where I'm at for this is that I would probably not worry much about 
it -- but also not add it to documentation since we will probably not want 
people setting it themselves. I think I'd leave it as 
`write.delete.granularity`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Build: Bump pytest from 7.4.3 to 7.4.4 [iceberg-python]

2024-01-02 Thread via GitHub



Fokko merged PR #248:
URL: https://github.com/apache/iceberg-python/pull/248


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] AWS: Add S3 Access Grants Integration [iceberg]

2024-01-02 Thread via GitHub



jackye1995 commented on code in PR #9385:
URL: https://github.com/apache/iceberg/pull/9385#discussion_r1439793304


##
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java:
##
@@ -749,4 +795,23 @@ public  void 
applyEndpointConfigurations(T builder) {
   builder.endpointOverride(URI.create(endpoint));
 }
   }
+
+  /**
+   * Add the S3 Access Grants Plugin for an S3 client.
+   *
+   * Sample usage:
+   *
+   * 
+   * 
S3Client.builder().applyMutation(s3FileIOProperties::applyS3AccessGrantsConfigurations)
+   * 
+   */
+  public  void applyS3AccessGrantsConfigurations(T 
builder) {
+if (isS3AccessGrantsEnabled) {
+  S3AccessGrantsPlugin s3AccessGrantsPlugin =

Review Comment:
   > but while compiling the iceberg-aws module, you'll still need both 
ApacheHttpClient and UrlConnectionHttpClient classes.
   
   We are talking about runtime dependency on a package, not compile time. Yes 
compile time dependency is required. 2MB is still significant in my mind since 
we are really trying to minimize bundle size as much as possible. But the more 
important issue is that, people who use the current AWS bundle would get a 
failure unless they update their stack to include the plugin they do not use, 
which is going to be a blocker for upgrades.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] AWS: Add S3 Access Grants Integration [iceberg]

2024-01-02 Thread via GitHub



jackye1995 commented on code in PR #9385:
URL: https://github.com/apache/iceberg/pull/9385#discussion_r1437796852


##
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java:
##
@@ -749,4 +795,23 @@ public  void 
applyEndpointConfigurations(T builder) {
   builder.endpointOverride(URI.create(endpoint));
 }
   }
+
+  /**
+   * Add the S3 Access Grants Plugin for an S3 client.
+   *
+   * Sample usage:
+   *
+   * 
+   * 
S3Client.builder().applyMutation(s3FileIOProperties::applyS3AccessGrantsConfigurations)
+   * 
+   */
+  public  void applyS3AccessGrantsConfigurations(T 
builder) {
+if (isS3AccessGrantsEnabled) {
+  S3AccessGrantsPlugin s3AccessGrantsPlugin =

Review Comment:
   I think there is an issue we need to handle here. This will force the users 
to have a dependency of the S3AccessGrantsPlugin library, even if they do not 
need this feature. To circumvent this, you can take a look at how we do that 
for things like HTTP client library dependencies: 
https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/ApacheHttpClientConfigurations.java.
 By moving all dependencies of a feature to a different class, it can avoid a 
runtime dependency.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] AWS: Add S3 Access Grants Integration [iceberg]

2024-01-02 Thread via GitHub



jackye1995 commented on code in PR #9385:
URL: https://github.com/apache/iceberg/pull/9385#discussion_r1439793304


##
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java:
##
@@ -749,4 +795,23 @@ public  void 
applyEndpointConfigurations(T builder) {
   builder.endpointOverride(URI.create(endpoint));
 }
   }
+
+  /**
+   * Add the S3 Access Grants Plugin for an S3 client.
+   *
+   * Sample usage:
+   *
+   * 
+   * 
S3Client.builder().applyMutation(s3FileIOProperties::applyS3AccessGrantsConfigurations)
+   * 
+   */
+  public  void applyS3AccessGrantsConfigurations(T 
builder) {
+if (isS3AccessGrantsEnabled) {
+  S3AccessGrantsPlugin s3AccessGrantsPlugin =

Review Comment:
   > but while compiling the iceberg-aws module, you'll still need both 
ApacheHttpClient and UrlConnectionHttpClient classes.
   
   Sorry my typo, we are talking about runtime dependency on a package, not 
compile time. Yes compile time dependency is required. 
   
   2MB is still significant in my mind since we are really trying to minimize 
bundle size as much as possible. But the more important issue is that, people 
who use the current AWS bundle would get a failure unless they update their 
stack to include the plugin they do not use, which is going to be a blocker for 
upgrades.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439797790


##
core/src/main/java/org/apache/iceberg/TableProperties.java:
##
@@ -334,6 +335,9 @@ private TableProperties() {}
   public static final String MAX_REF_AGE_MS = "history.expire.max-ref-age-ms";
   public static final long MAX_REF_AGE_MS_DEFAULT = Long.MAX_VALUE;
 
+  public static final String DELETE_GRANULARITY = "write.delete.granularity";

Review Comment:
   The idea not to document it for now is reasonable given that it acts like a 
recommendation and we are not sure we want to support it in the future. Let's 
keep the name as is then.
   
   Adding a way to configure position and equality deletes separately is 
another discussion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on code in PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439799553


##
core/src/main/java/org/apache/iceberg/TableProperties.java:
##
@@ -334,6 +335,9 @@ private TableProperties() {}
   public static final String MAX_REF_AGE_MS = "history.expire.max-ref-age-ms";
   public static final long MAX_REF_AGE_MS_DEFAULT = Long.MAX_VALUE;
 
+  public static final String DELETE_GRANULARITY = "write.delete.granularity";

Review Comment:
   Reverted based on 
[this](https://github.com/apache/iceberg/pull/9384/files#r1436606249) 
discussion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.3, 3.4: Backport #9255 - Fix clobbering of files across epochs [iceberg]

2024-01-02 Thread via GitHub



rdblue merged PR #9399:
URL: https://github.com/apache/iceberg/pull/9399


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi merged PR #9384:
URL: https://github.com/apache/iceberg/pull/9384


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]

2024-01-02 Thread via GitHub



aokolnychyi commented on PR #9384:
URL: https://github.com/apache/iceberg/pull/9384#issuecomment-187452

   Thanks for reviewing, @jerqi @zinking @zhongyujiang @rdblue @RussellSpitzer!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Files are being overwritten on subsequent runs of Spark Structured Streaming [iceberg]

2024-01-02 Thread via GitHub



amogh-jahagirdar closed issue #8609: Files are being overwritten on subsequent 
runs of Spark Structured Streaming
URL: https://github.com/apache/iceberg/issues/8609


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Files are being overwritten on subsequent runs of Spark Structured Streaming [iceberg]

2024-01-02 Thread via GitHub



amogh-jahagirdar commented on issue #8609:
URL: https://github.com/apache/iceberg/issues/8609#issuecomment-1874704535

   This issue should be resolved in https://github.com/apache/iceberg/pull/9255 
and https://github.com/apache/iceberg/pull/9399 (backports to Spark 3.3 and 
Spark 3.4).
   
   This change should be released in the 1.5 release. I'm going to close this 
issue for now, please feel free to reopen in case you still have any doubts or 
questions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Rewrite manifest action can only split large manifest file into two manifests，instead of expected target size [iceberg]

2024-01-02 Thread via GitHub



github-actions[bot] closed issue #6891: Rewrite manifest action can only split 
large manifest file into two manifests，instead of expected target size
URL: https://github.com/apache/iceberg/issues/6891


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Rewrite manifest action can only split large manifest file into two manifests，instead of expected target size [iceberg]

2024-01-02 Thread via GitHub



github-actions[bot] commented on issue #6891:
URL: https://github.com/apache/iceberg/issues/6891#issuecomment-1874709553

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Renaming a table may conflict with the new table with old table name [iceberg]

2024-01-02 Thread via GitHub



github-actions[bot] commented on issue #6890:
URL: https://github.com/apache/iceberg/issues/6890#issuecomment-1874709577

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Renaming a table may conflict with the new table with old table name [iceberg]

2024-01-02 Thread via GitHub



github-actions[bot] closed issue #6890: Renaming a table may conflict with the 
new table with old table name
URL: https://github.com/apache/iceberg/issues/6890


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Deliver key metadata for encryption of data files [iceberg]

2024-01-02 Thread via GitHub



rdblue commented on code in PR #9359:
URL: https://github.com/apache/iceberg/pull/9359#discussion_r1439975347


##
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkAppenderFactory.java:
##
@@ -161,7 +162,12 @@ private StructType lazyPosDeleteSparkType() {
   }
 
   @Override
-  public FileAppender newAppender(OutputFile file, FileFormat 
fileFormat) {
+  public FileAppender newAppender(OutputFile outputFile, 
FileFormat format) {
+return newAppender(EncryptionUtil.plainAsEncryptedOutput(outputFile), 
format);
+  }
+
+  @Override
+  public FileAppender newAppender(EncryptedOutputFile file, 
FileFormat fileFormat) {

Review Comment:
   Ah, I was expecting these changes in Spark 3.5. Feel free to move them 
there, or we can add them to Spark 3.4 and open follow up PRs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Deliver key metadata for encryption of data files [iceberg]

2024-01-02 Thread via GitHub



rdblue commented on code in PR #9359:
URL: https://github.com/apache/iceberg/pull/9359#discussion_r1439977630


##
core/src/main/java/org/apache/iceberg/encryption/StandardEncryptionManager.java:
##
@@ -67,7 +72,15 @@ public InputFile decrypt(EncryptedInputFile encrypted) {
   @Override
   public Iterable decrypt(Iterable encrypted) {
 // Bulk decrypt is only applied to data files. Returning source input 
files for parquet.
-return Iterables.transform(encrypted, this::decrypt);
+if (nativeDataEncryption) {
+  return Iterables.transform(encrypted, this::getSourceFile);
+} else {
+  return Iterables.transform(encrypted, this::decrypt);
+}
+  }
+
+  private InputFile getSourceFile(EncryptedInputFile encryptedFile) {

Review Comment:
   Style: Iceberg method names should not include `get`. Instead, use a more 
helpful verb like `create` or `find`, or simply leave it out.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Deliver key metadata for encryption of data files [iceberg]

2024-01-02 Thread via GitHub



rdblue commented on code in PR #9359:
URL: https://github.com/apache/iceberg/pull/9359#discussion_r1439979355


##
core/src/main/java/org/apache/iceberg/encryption/StandardEncryptionManager.java:
##
@@ -41,7 +42,10 @@ public class StandardEncryptionManager implements 
EncryptionManager {
* @param kmsClient Client of KMS used to wrap/unwrap keys in envelope 
encryption
*/
   public StandardEncryptionManager(
-  String tableKeyId, int dataKeyLength, KeyManagementClient kmsClient) {
+  String tableKeyId,
+  int dataKeyLength,
+  KeyManagementClient kmsClient,
+  boolean nativeDataEncryption) {

Review Comment:
   I don't think that this should be passed in. The encryption manager needs to 
support files that use both native encryption (Parquet) and files that use AES 
GCM streams (Avro). There is no way to set this correctly because the behavior 
depends on the file type.
   
   Instead, this should _always_ call `decrypt` and the caller should know how 
to handle the resulting `InputFile`. If it is simply an `InputFile` then it 
should be used directly. If it is a `StandardDecryptedInputFile` then it should 
be used for Avro but trigger native decryption for Parquet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Deliver key metadata for encryption of data files [iceberg]

2024-01-02 Thread via GitHub



rdblue commented on code in PR #9359:
URL: https://github.com/apache/iceberg/pull/9359#discussion_r1439979355


##
core/src/main/java/org/apache/iceberg/encryption/StandardEncryptionManager.java:
##
@@ -41,7 +42,10 @@ public class StandardEncryptionManager implements 
EncryptionManager {
* @param kmsClient Client of KMS used to wrap/unwrap keys in envelope 
encryption
*/
   public StandardEncryptionManager(
-  String tableKeyId, int dataKeyLength, KeyManagementClient kmsClient) {
+  String tableKeyId,
+  int dataKeyLength,
+  KeyManagementClient kmsClient,
+  boolean nativeDataEncryption) {

Review Comment:
   I don't think that this should be passed in. The encryption manager needs to 
support files that use both native encryption (Parquet) and files that use AES 
GCM streams (Avro). There is no way to set this correctly because the behavior 
depends on the file type.
   
   Instead, the caller should decide whether to call `decrypt` based on whether 
it is needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] AWS: Add S3 Access Grants Integration [iceberg]

2024-01-02 Thread via GitHub



jackye1995 commented on code in PR #9385:
URL: https://github.com/apache/iceberg/pull/9385#discussion_r1439983680


##
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java:
##
@@ -749,4 +795,23 @@ public  void 
applyEndpointConfigurations(T builder) {
   builder.endpointOverride(URI.create(endpoint));
 }
   }
+
+  /**
+   * Add the S3 Access Grants Plugin for an S3 client.
+   *
+   * Sample usage:
+   *
+   * 
+   * 
S3Client.builder().applyMutation(s3FileIOProperties::applyS3AccessGrantsConfigurations)
+   * 
+   */
+  public  void applyS3AccessGrantsConfigurations(T 
builder) {
+if (isS3AccessGrantsEnabled) {
+  S3AccessGrantsPlugin s3AccessGrantsPlugin =

Review Comment:
   All the AWS dependencies are not packaged together with the iceberg-aws 
module: https://github.com/apache/iceberg/blob/main/build.gradle#L467-L476. 
This is because the users typically manage their own AWS SDK version that is 
very likely different from the one specified by this project.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Deliver key metadata for encryption of data files [iceberg]

2024-01-02 Thread via GitHub



rdblue commented on code in PR #9359:
URL: https://github.com/apache/iceberg/pull/9359#discussion_r1439979355


##
core/src/main/java/org/apache/iceberg/encryption/StandardEncryptionManager.java:
##
@@ -41,7 +42,10 @@ public class StandardEncryptionManager implements 
EncryptionManager {
* @param kmsClient Client of KMS used to wrap/unwrap keys in envelope 
encryption
*/
   public StandardEncryptionManager(
-  String tableKeyId, int dataKeyLength, KeyManagementClient kmsClient) {
+  String tableKeyId,
+  int dataKeyLength,
+  KeyManagementClient kmsClient,
+  boolean nativeDataEncryption) {

Review Comment:
   I don't think that this should be passed in. The encryption manager needs to 
support files that use both native encryption (Parquet) and files that use AES 
GCM streams (Avro). There is no way to set this correctly because the behavior 
depends on the file type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Deliver key metadata for encryption of data files [iceberg]

2024-01-02 Thread via GitHub



rdblue commented on code in PR #9359:
URL: https://github.com/apache/iceberg/pull/9359#discussion_r1439988090


##
core/src/main/java/org/apache/iceberg/encryption/StandardEncryptionManager.java:
##
@@ -67,7 +72,15 @@ public InputFile decrypt(EncryptedInputFile encrypted) {
   @Override
   public Iterable decrypt(Iterable encrypted) {
 // Bulk decrypt is only applied to data files. Returning source input 
files for parquet.
-return Iterables.transform(encrypted, this::decrypt);
+if (nativeDataEncryption) {

Review Comment:
   There is no way to know the intended file format or whether it will use 
native encryption at this point. I think this needs to always return the result 
of `decrypt`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Close the MetricsReporter when the Catalog is closed. [iceberg]

2024-01-02 Thread via GitHub



dramaticlly commented on code in PR #9353:
URL: https://github.com/apache/iceberg/pull/9353#discussion_r1439989287


##
aws/src/main/java/org/apache/iceberg/aws/dynamodb/DynamoDbCatalog.java:
##
@@ -487,6 +486,7 @@ public Configuration getConf() {
 
   @Override
   public void close() throws IOException {
+super.close();

Review Comment:
   I think in order to close metric reporter when cloase the catalog, it can 
also be implemented to add `super.metricsReporter()` to closeableGroup on line 
143, but this require to change the `BaseMetastoreCatalog.metricsReporter()` 
method from private to protected. 
   
   This approach keep the `Closeable` in concrete class instead of abstract 
base class, and it seem to be more aligned with how other closable are handled 
in the this class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Deliver key metadata for encryption of data files [iceberg]

2024-01-02 Thread via GitHub



rdblue commented on PR #9359:
URL: https://github.com/apache/iceberg/pull/9359#issuecomment-187478

   @ggershinsky, I'm still working through this review, so don't feel like you 
need to address or respond to my comments yet! Also, here are a few notes for 
myself when I pick this up tomorrow:
   
   * How do we ensure that all files that are committed to a table are 
encrypted? I think we should have a validation in `FastAppend` and 
`MergingSnapshotProducer` that verifies the encryption key metadata is non-null 
if the standard encryption manager is used.
   * There are places like `BaseTaskWriter.openCurrent` that also call 
`EncryptedOutputFile.encryptingOutputFile()` to get the location. We should 
look into whether those should use the underlying file path or need to be 
updated.
   * I think the way to solve the problem with 
`StandardEncryptionManager.decrypt` is to make `StandardDecryptedInputFile` 
implement both `InputFile` (that runs AES GCM decryption) and 
`EncryptedInputFile` (to provide access to key metadata and the encrypted 
underlying `InputFile`). Then the read path continues to pass around 
`InputFile`, but can check whether the file can be used via 
`EncryptedInputFile` for native decryption.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Close the MetricsReporter when the Catalog is closed. [iceberg]

2024-01-02 Thread via GitHub



huyuanfeng2018 commented on code in PR #9353:
URL: https://github.com/apache/iceberg/pull/9353#discussion_r1440003882


##
aws/src/main/java/org/apache/iceberg/aws/dynamodb/DynamoDbCatalog.java:
##
@@ -487,6 +486,7 @@ public Configuration getConf() {
 
   @Override
   public void close() throws IOException {
+super.close();

Review Comment:
   > I think in order to close metric reporter when cloase the catalog, it can 
also be implemented to add `super.metricsReporter()` to closeableGroup on line 
143, but this require to change the `BaseMetastoreCatalog.metricsReporter()` 
method from private to protected.
   > 
   > This approach keep the `Closeable` in concrete class instead of abstract 
base class, and it seem to be more aligned with how other closable are handled 
in the this class.
   
   I agree with this statement. In the Java language, it is generally a common 
practice to call super.close() when a subclass overrides the method of the 
parent class, so I did do this at the beginning. If want to unify for the 
`closable` approach, it is indeed better to use `closeableGroup`. I will modify 
this part of the code.
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]

2024-01-02 Thread via GitHub



szehon-ho commented on code in PR #8755:
URL: https://github.com/apache/iceberg/pull/8755#discussion_r1433435467


##
core/src/main/java/org/apache/iceberg/deletes/PositionDeleteIndex.java:
##
@@ -44,4 +44,14 @@ public interface PositionDeleteIndex {
 
   /** Returns true if this collection contains no element. */
   boolean isEmpty();
+
+  /** Returns true if this collection contains elements. */
+  default boolean isNotEmpty() {

Review Comment:
   is this really needed?  (versus !empty?)



##
core/src/main/java/org/apache/iceberg/deletes/PositionDeleteIndex.java:
##
@@ -44,4 +44,14 @@ public interface PositionDeleteIndex {
 
   /** Returns true if this collection contains no element. */
   boolean isEmpty();
+
+  /** Returns true if this collection contains elements. */
+  default boolean isNotEmpty() {

Review Comment:
   is this really needed? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] AWS: Add S3 Access Grants Integration [iceberg]

2024-01-02 Thread via GitHub



adnanhemani commented on code in PR #9385:
URL: https://github.com/apache/iceberg/pull/9385#discussion_r1440008488


##
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java:
##
@@ -749,4 +795,23 @@ public  void 
applyEndpointConfigurations(T builder) {
   builder.endpointOverride(URI.create(endpoint));
 }
   }
+
+  /**
+   * Add the S3 Access Grants Plugin for an S3 client.
+   *
+   * Sample usage:
+   *
+   * 
+   * 
S3Client.builder().applyMutation(s3FileIOProperties::applyS3AccessGrantsConfigurations)
+   * 
+   */
+  public  void applyS3AccessGrantsConfigurations(T 
builder) {
+if (isS3AccessGrantsEnabled) {
+  S3AccessGrantsPlugin s3AccessGrantsPlugin =

Review Comment:
   I don't think we're on the same page about this - the S3 Access Grants 
plugin code is separate from the remaining AWS SDK and therefore can be treated 
as any other 3rd-party dependency. But this is a moot point with the new change 
I've introduced.
   
   The size increase after these new changes are just 2.2KB (likely due to the 
source code introduced itself at this point). Please let me know if these 
changes are sufficient or if there are any other changes that need to be made. 
Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.5: Support filtering with buckets in RewriteDataFilesProcedure [iceberg]

2024-01-02 Thread via GitHub



manuzhang commented on PR #9396:
URL: https://github.com/apache/iceberg/pull/9396#issuecomment-1874770664

   @RussellSpitzer could you please give an example of passing partition(esp. 
bucket) via filter expression?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] how to integrations object storage ceph ? [iceberg]

2024-01-02 Thread via GitHub



hchautrung commented on issue #7158:
URL: https://github.com/apache/iceberg/issues/7158#issuecomment-1874771632

   Hi @jkl0898 ,
   
   I am looking a solution to use Ceph with Iceberg. Currently I used MinIO but 
the we looking for an alterantive solution to replace MinIO. Could you share 
technical detail how to config Ceph with Iceberg. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Spark: Correct the delete record count for PartitionTable [iceberg]

2024-01-02 Thread via GitHub



ConeyLiu commented on PR #9389:
URL: https://github.com/apache/iceberg/pull/9389#issuecomment-1874773959

   Thanks @singhpk234 @RussellSpitzer @dramaticlly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]

2024-01-02 Thread via GitHub



szehon-ho commented on code in PR #8755:
URL: https://github.com/apache/iceberg/pull/8755#discussion_r1434014371


##
data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java:
##
@@ -0,0 +1,260 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.data;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.Queue;
+import java.util.concurrent.ConcurrentLinkedQueue;
+import java.util.concurrent.ExecutorService;
+import java.util.function.Function;
+import java.util.function.Supplier;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.MetadataColumns;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.StructLike;
+import org.apache.iceberg.avro.Avro;
+import org.apache.iceberg.data.avro.DataReader;
+import org.apache.iceberg.data.orc.GenericOrcReader;
+import org.apache.iceberg.data.parquet.GenericParquetReaders;
+import org.apache.iceberg.deletes.Deletes;
+import org.apache.iceberg.deletes.PositionDeleteIndex;
+import org.apache.iceberg.deletes.PositionDeleteIndexUtil;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.DeleteSchemaUtil;
+import org.apache.iceberg.io.InputFile;
+import org.apache.iceberg.orc.ORC;
+import org.apache.iceberg.orc.OrcRowReader;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.parquet.ParquetValueReader;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.apache.iceberg.relocated.com.google.common.math.LongMath;
+import org.apache.iceberg.types.TypeUtil;
+import org.apache.iceberg.util.CharSequenceMap;
+import org.apache.iceberg.util.StructLikeSet;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.apache.orc.TypeDescription;
+import org.apache.parquet.schema.MessageType;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class BaseDeleteLoader implements DeleteLoader {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(BaseDeleteLoader.class);
+  private static final Schema POS_DELETE_SCHEMA = 
DeleteSchemaUtil.pathPosSchema();
+
+  private final Function loadInputFile;
+  private final ExecutorService workerPool;
+
+  public BaseDeleteLoader(Function loadInputFile) {
+this(loadInputFile, ThreadPools.getDeleteWorkerPool());
+  }
+
+  public BaseDeleteLoader(
+  Function loadInputFile, ExecutorService 
workerPool) {
+this.loadInputFile = loadInputFile;
+this.workerPool = workerPool;
+  }
+
+  /**
+   * Checks if the given number of bytes can be cached.
+   *
+   * Implementations should override this method if they support caching. 
It is also recommended
+   * to use the provided size as a guideline to decide whether the value is 
eligible for caching.
+   * For instance, it may be beneficial to discard values that are too large 
to optimize the cache
+   * performance and utilization.
+   */
+  protected boolean canCache(long size) {
+return false;
+  }
+
+  /**
+   * Gets the cached value for the key or populates the cache with a new 
mapping.
+   *
+   * If the value for the specified key is in the cache, it should be 
returned. If the value is
+   * not in the cache, implementations should compute the value using the 
provided supplier, cache
+   * it, and then return it.
+   *
+   * This method will be called only if {@link #canCache(long)} returned 
true.
+   */
+  protected  V get(String key, Supplier valueSupplier, long valueSize) {

Review Comment:
   get() is a bit confusing, can we call it getOrLoad (like on the calling 
method), or getOrPopulate()?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---

Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]

2024-01-02 Thread via GitHub



szehon-ho commented on code in PR #8755:
URL: https://github.com/apache/iceberg/pull/8755#discussion_r1434014371


##
data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java:
##
@@ -0,0 +1,260 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.data;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.Queue;
+import java.util.concurrent.ConcurrentLinkedQueue;
+import java.util.concurrent.ExecutorService;
+import java.util.function.Function;
+import java.util.function.Supplier;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.MetadataColumns;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.StructLike;
+import org.apache.iceberg.avro.Avro;
+import org.apache.iceberg.data.avro.DataReader;
+import org.apache.iceberg.data.orc.GenericOrcReader;
+import org.apache.iceberg.data.parquet.GenericParquetReaders;
+import org.apache.iceberg.deletes.Deletes;
+import org.apache.iceberg.deletes.PositionDeleteIndex;
+import org.apache.iceberg.deletes.PositionDeleteIndexUtil;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.DeleteSchemaUtil;
+import org.apache.iceberg.io.InputFile;
+import org.apache.iceberg.orc.ORC;
+import org.apache.iceberg.orc.OrcRowReader;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.parquet.ParquetValueReader;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.apache.iceberg.relocated.com.google.common.math.LongMath;
+import org.apache.iceberg.types.TypeUtil;
+import org.apache.iceberg.util.CharSequenceMap;
+import org.apache.iceberg.util.StructLikeSet;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.apache.orc.TypeDescription;
+import org.apache.parquet.schema.MessageType;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class BaseDeleteLoader implements DeleteLoader {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(BaseDeleteLoader.class);
+  private static final Schema POS_DELETE_SCHEMA = 
DeleteSchemaUtil.pathPosSchema();
+
+  private final Function loadInputFile;
+  private final ExecutorService workerPool;
+
+  public BaseDeleteLoader(Function loadInputFile) {
+this(loadInputFile, ThreadPools.getDeleteWorkerPool());
+  }
+
+  public BaseDeleteLoader(
+  Function loadInputFile, ExecutorService 
workerPool) {
+this.loadInputFile = loadInputFile;
+this.workerPool = workerPool;
+  }
+
+  /**
+   * Checks if the given number of bytes can be cached.
+   *
+   * Implementations should override this method if they support caching. 
It is also recommended
+   * to use the provided size as a guideline to decide whether the value is 
eligible for caching.
+   * For instance, it may be beneficial to discard values that are too large 
to optimize the cache
+   * performance and utilization.
+   */
+  protected boolean canCache(long size) {
+return false;
+  }
+
+  /**
+   * Gets the cached value for the key or populates the cache with a new 
mapping.
+   *
+   * If the value for the specified key is in the cache, it should be 
returned. If the value is
+   * not in the cache, implementations should compute the value using the 
provided supplier, cache
+   * it, and then return it.
+   *
+   * This method will be called only if {@link #canCache(long)} returned 
true.
+   */
+  protected  V get(String key, Supplier valueSupplier, long valueSize) {

Review Comment:
   get() is a bit confusing, can we call it getOrLoad (like on the calling 
method), or maybe better computeIfAbsent()?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]

2024-01-02 Thread via GitHub



szehon-ho commented on code in PR #8755:
URL: https://github.com/apache/iceberg/pull/8755#discussion_r1434014371


##
data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java:
##
@@ -0,0 +1,260 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.data;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.Queue;
+import java.util.concurrent.ConcurrentLinkedQueue;
+import java.util.concurrent.ExecutorService;
+import java.util.function.Function;
+import java.util.function.Supplier;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.MetadataColumns;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.StructLike;
+import org.apache.iceberg.avro.Avro;
+import org.apache.iceberg.data.avro.DataReader;
+import org.apache.iceberg.data.orc.GenericOrcReader;
+import org.apache.iceberg.data.parquet.GenericParquetReaders;
+import org.apache.iceberg.deletes.Deletes;
+import org.apache.iceberg.deletes.PositionDeleteIndex;
+import org.apache.iceberg.deletes.PositionDeleteIndexUtil;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.DeleteSchemaUtil;
+import org.apache.iceberg.io.InputFile;
+import org.apache.iceberg.orc.ORC;
+import org.apache.iceberg.orc.OrcRowReader;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.parquet.ParquetValueReader;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.apache.iceberg.relocated.com.google.common.math.LongMath;
+import org.apache.iceberg.types.TypeUtil;
+import org.apache.iceberg.util.CharSequenceMap;
+import org.apache.iceberg.util.StructLikeSet;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.apache.orc.TypeDescription;
+import org.apache.parquet.schema.MessageType;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class BaseDeleteLoader implements DeleteLoader {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(BaseDeleteLoader.class);
+  private static final Schema POS_DELETE_SCHEMA = 
DeleteSchemaUtil.pathPosSchema();
+
+  private final Function loadInputFile;
+  private final ExecutorService workerPool;
+
+  public BaseDeleteLoader(Function loadInputFile) {
+this(loadInputFile, ThreadPools.getDeleteWorkerPool());
+  }
+
+  public BaseDeleteLoader(
+  Function loadInputFile, ExecutorService 
workerPool) {
+this.loadInputFile = loadInputFile;
+this.workerPool = workerPool;
+  }
+
+  /**
+   * Checks if the given number of bytes can be cached.
+   *
+   * Implementations should override this method if they support caching. 
It is also recommended
+   * to use the provided size as a guideline to decide whether the value is 
eligible for caching.
+   * For instance, it may be beneficial to discard values that are too large 
to optimize the cache
+   * performance and utilization.
+   */
+  protected boolean canCache(long size) {
+return false;
+  }
+
+  /**
+   * Gets the cached value for the key or populates the cache with a new 
mapping.
+   *
+   * If the value for the specified key is in the cache, it should be 
returned. If the value is
+   * not in the cache, implementations should compute the value using the 
provided supplier, cache
+   * it, and then return it.
+   *
+   * This method will be called only if {@link #canCache(long)} returned 
true.
+   */
+  protected  V get(String key, Supplier valueSupplier, long valueSize) {

Review Comment:
   get() is a bit confusing, can we call it getOrLoad (like on the calling 
method), or maybe better getOrDefault() as in Java map API (renaming the other 
arguments to be default)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infras

[PR] Spark 3.5: Set log level to WARN for rewrite task failure with partial progress [iceberg]

2024-01-02 Thread via GitHub



manuzhang opened a new pull request, #9400:
URL: https://github.com/apache/iceberg/pull/9400

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Request distribution and ordering for writes [iceberg]

2024-01-02 Thread via GitHub



maytasm commented on PR #3461:
URL: https://github.com/apache/iceberg/pull/3461#issuecomment-1874846377

   If in Spark 3.3, users no longer have to explicitly sort their data before 
INSERT into a partitioned table, then should we insert global sort into the 
plan rather than a local sort? 
https://github.com/apache/iceberg/pull/3461/files#diff-e13fee34cfc35da918a0a1a668758085888e87ba9eb36685121a4fad009fa649R74


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Suppress exceptions in case of dropTableData [iceberg]

2024-01-02 Thread via GitHub



nk1506 commented on PR #9184:
URL: https://github.com/apache/iceberg/pull/9184#issuecomment-1874868435

   @Fokko , A gentle reminder for the same. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Ensure that partition stats files are considered for GC procedures [iceberg]

2024-01-02 Thread via GitHub



ajantha-bhat commented on PR #9284:
URL: https://github.com/apache/iceberg/pull/9284#issuecomment-1874869959

   ping @aokolnychyi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

83 matches

Mail list logo