Re: [PR] Spark: Added merge schema as spark configuration [iceberg]
Aleena-M-Georgy closed pull request #9397: Spark: Added merge schema as spark configuration URL: https://github.com/apache/iceberg/pull/9397 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Apache Flink not committing new snapshots to Iceberg Table [iceberg]
FranMorilloAWS commented on issue #9089: URL: https://github.com/apache/iceberg/issues/9089#issuecomment-1873872091 Any thoughts on why this could be happening? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Adding Junit 5 conversion and AssertJ style for TestFlinkCatalogTable… [iceberg]
vinitpatni commented on PR #9381: URL: https://github.com/apache/iceberg/pull/9381#issuecomment-1874003109 - Adding Junit 5 conversion and AssertJ style for TestFlinkUpsert and TestRewriteDataFilesAction -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Adding Junit 5 conversion and AssertJ style for TestFlinkCatalogTable… [iceberg]
vinitpatni commented on PR #9381: URL: https://github.com/apache/iceberg/pull/9381#issuecomment-1874010083 @nastra TestStreamScanSql is the only subclass remaining for conversion to Junit 5 but it has dependency on GenericAppenderHelper class which is part of iceberg-data module. Let me know how to tackle that ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Create JUnit5-version of FlinkCatalogTestBase [iceberg]
vinitpatni commented on issue #9079: URL: https://github.com/apache/iceberg/issues/9079#issuecomment-1874038839 Following is the active PR for this one : https://github.com/apache/iceberg/pull/9381 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Schema IDs Re-Order? [iceberg-python]
sebpretzer closed issue #229: Schema IDs Re-Order? URL: https://github.com/apache/iceberg-python/issues/229 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Schema IDs Re-Order? [iceberg-python]
sebpretzer commented on issue #229: URL: https://github.com/apache/iceberg-python/issues/229#issuecomment-1874187979 @Fokko To follow up, we have fixed this internally ourselves with something similar to below: ```python from pydantic import BaseModel, field_validator from pyiceberg.schema import Schema, assign_fresh_schema_ids class FooBarModel(BaseModel) iceberg: Schema @field_validator("iceberg") def field_ids_in_preorder(cls, v: Schema) -> Schema: if v != assign_fresh_schema_ids(v): raise ValueError("Schema ids not in preorder") return v ``` It places a bit more burden on users to properly order the field ids, but the schema will no longer change underneath them. If you want something similar upstream in `Schema`, I am happy to take a crack at it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439572133 ## core/src/main/java/org/apache/iceberg/io/ClusteredPositionDeleteWriter.java: ## @@ -46,17 +49,39 @@ public ClusteredPositionDeleteWriter( OutputFileFactory fileFactory, FileIO io, long targetFileSizeInBytes) { +this(writerFactory, fileFactory, io, targetFileSizeInBytes, DeleteGranularity.PARTITION); + } + + public ClusteredPositionDeleteWriter( + FileWriterFactory writerFactory, + OutputFileFactory fileFactory, + FileIO io, + long targetFileSizeInBytes, + DeleteGranularity granularity) { this.writerFactory = writerFactory; this.fileFactory = fileFactory; this.io = io; this.targetFileSizeInBytes = targetFileSizeInBytes; +this.granularity = granularity; this.deleteFiles = Lists.newArrayList(); this.referencedDataFiles = CharSequenceSet.empty(); } @Override protected FileWriter, DeleteWriteResult> newWriter( PartitionSpec spec, StructLike partition) { +switch (granularity) { + case FILE: +return new TargetedPositionDeleteWriter<>(() -> newRollingWriter(spec, partition)); Review Comment: We can actually roll correctly here because this is the "clustered" path. We are not going to use the sorting writer and will not buffer deletes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439572133 ## core/src/main/java/org/apache/iceberg/io/ClusteredPositionDeleteWriter.java: ## @@ -46,17 +49,39 @@ public ClusteredPositionDeleteWriter( OutputFileFactory fileFactory, FileIO io, long targetFileSizeInBytes) { +this(writerFactory, fileFactory, io, targetFileSizeInBytes, DeleteGranularity.PARTITION); + } + + public ClusteredPositionDeleteWriter( + FileWriterFactory writerFactory, + OutputFileFactory fileFactory, + FileIO io, + long targetFileSizeInBytes, + DeleteGranularity granularity) { this.writerFactory = writerFactory; this.fileFactory = fileFactory; this.io = io; this.targetFileSizeInBytes = targetFileSizeInBytes; +this.granularity = granularity; this.deleteFiles = Lists.newArrayList(); this.referencedDataFiles = CharSequenceSet.empty(); } @Override protected FileWriter, DeleteWriteResult> newWriter( PartitionSpec spec, StructLike partition) { +switch (granularity) { + case FILE: +return new TargetedPositionDeleteWriter<>(() -> newRollingWriter(spec, partition)); Review Comment: We can actually roll correctly here because this is the "clustered" path. We are not going to use the sorting writer and will not buffer deletes. We can also roll correctly in the "fanout" path cause the sorting writer acts as a wrapper on top of the rolling writer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439581834 ## core/src/main/java/org/apache/iceberg/deletes/TargetedPositionDeleteWriter.java: ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.deletes; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.List; +import java.util.function.Supplier; +import org.apache.iceberg.DeleteFile; +import org.apache.iceberg.io.DeleteWriteResult; +import org.apache.iceberg.io.FileWriter; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.util.CharSequenceSet; + +/** + * A position delete writer that produces a separate delete file for each referenced data file. + * + * This writer does not keep track of seen deletes and assumes all incoming records are ordered + * by file and position as required by the spec. If there is no external process to order the + * records, consider using {@link SortingPositionOnlyDeleteWriter} instead. + */ +public class TargetedPositionDeleteWriter +implements FileWriter, DeleteWriteResult> { + + private final Supplier, DeleteWriteResult>> writers; + private final List deleteFiles; + private final CharSequenceSet referencedDataFiles; + + private FileWriter, DeleteWriteResult> currentWriter = null; + private CharSequence currentPath = null; + private boolean closed = false; + + public TargetedPositionDeleteWriter( + Supplier, DeleteWriteResult>> writers) { +this.writers = writers; +this.deleteFiles = Lists.newArrayList(); +this.referencedDataFiles = CharSequenceSet.empty(); + } + + @Override + public void write(PositionDelete positionDelete) { +writer(positionDelete.path()).write(positionDelete); + } + + private FileWriter, DeleteWriteResult> writer(CharSequence path) { +if (currentWriter == null) { + openCurrentWriter(path); +} else if (unequal(currentPath, path)) { + closeCurrentWriter(); + openCurrentWriter(path); +} + +return currentWriter; + } + + @Override + public long length() { +throw new UnsupportedOperationException(getClass().getName() + " does not implement length"); Review Comment: We don't need cause this writer wraps the rolling writer, not the other way around. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439583443 ## core/src/main/java/org/apache/iceberg/TableProperties.java: ## @@ -334,6 +335,9 @@ private TableProperties() {} public static final String MAX_REF_AGE_MS = "history.expire.max-ref-age-ms"; public static final long MAX_REF_AGE_MS_DEFAULT = Long.MAX_VALUE; + public static final String DELETE_GRANULARITY = "write.delete.granularity"; + public static final String DELETE_GRANULARITY_DEFAULT = DeleteGranularity.PARTITION.toString(); Review Comment: I started with a string constant but then saw what we did for `RowLevelOperationMode` and decided to follow that for consistency. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439592489 ## core/src/main/java/org/apache/iceberg/deletes/SortingPositionOnlyDeleteWriter.java: ## @@ -60,7 +72,7 @@ public void write(PositionDelete positionDelete) { @Override public long length() { -return writer.length(); +return result != null ? result.totalFileSizeInBytes() : 0L; Review Comment: I'll probably switch to not implementing it at all, just like we do in the other writer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439605409 ## core/src/main/java/org/apache/iceberg/deletes/TargetedPositionDeleteWriter.java: ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.deletes; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.List; +import java.util.function.Supplier; +import org.apache.iceberg.DeleteFile; +import org.apache.iceberg.io.DeleteWriteResult; +import org.apache.iceberg.io.FileWriter; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.util.CharSequenceSet; + +/** + * A position delete writer that produces a separate delete file for each referenced data file. + * + * This writer does not keep track of seen deletes and assumes all incoming records are ordered + * by file and position as required by the spec. If there is no external process to order the + * records, consider using {@link SortingPositionOnlyDeleteWriter} instead. + */ +public class TargetedPositionDeleteWriter Review Comment: I believe `Clustered` is something we use for `PartitioningWriter` implementations to indicate that incoming records are grouped by spec and partition. If we use that prefix in this context, it may be a bit misleading. I renamed this class to `FileScopedPositionDeleteWriter`. Let me know what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Support filtering with buckets in RewriteDataFilesProcedure [iceberg]
RussellSpitzer commented on PR #9396: URL: https://github.com/apache/iceberg/pull/9396#issuecomment-1874242699 Why does this need a new api? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Spark: Correct the delete record count for PartitionTable [iceberg]
RussellSpitzer merged PR #9389: URL: https://github.com/apache/iceberg/pull/9389 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Spark: Correct the delete record count for PartitionTable [iceberg]
RussellSpitzer commented on PR #9389: URL: https://github.com/apache/iceberg/pull/9389#issuecomment-1874247332 LGTM! Thanks @ConeyLiu for the PR and @singhpk234 for the review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
RussellSpitzer commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439622150 ## core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.deletes; + +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; + +/** + * An enum that represents the granularity of deletes. + * + * Under partition granularity, delete writers are allowed to group deletes for different data Review Comment: "are allowed"? Perhaps maybe we should say something like "are directed to group deletes". I think the text in this doc goes a bit back and forth between saying that the delete writers will do something and the delete writers may do something. I think it may also help to kind of express these as (Many data files -> One Delete file) and (One data file -> One Delete File) or something like that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark Streaming: Fix clobbering of files across streaming epochs [iceberg]
rdblue merged PR #9255: URL: https://github.com/apache/iceberg/pull/9255 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Parquet file overwritten by spark streaming job in subsequent execution with same spark streaming checkpoint location [iceberg]
rdblue closed issue #9172: Parquet file overwritten by spark streaming job in subsequent execution with same spark streaming checkpoint location URL: https://github.com/apache/iceberg/issues/9172 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Duplicate file name in Iceberg's metadata [iceberg]
rdblue closed issue #8953: Duplicate file name in Iceberg's metadata URL: https://github.com/apache/iceberg/issues/8953 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
RussellSpitzer commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439631070 ## core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.deletes; + +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; + +/** + * An enum that represents the granularity of deletes. + * + * Under partition granularity, delete writers are allowed to group deletes for different data + * files into one delete file. This strategy tends to reduce the total number of delete files in the + * table. However, it may lead to the assignment of irrelevant deletes to some data files during the Review Comment: Potential Rewrite? Trying to make this a but more directly worded However, a scan for a single data file will require reading delete information for multiple data files in the partition even if those other files are not required for the scan. This information will be ignored during the reads but reading this extra delete information will cause overhead. The overhead can potentially be mitigated via delete file caching (link here?). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
RussellSpitzer commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439633400 ## core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.deletes; + +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; + +/** + * An enum that represents the granularity of deletes. + * + * Under partition granularity, delete writers are allowed to group deletes for different data + * files into one delete file. This strategy tends to reduce the total number of delete files in the + * table. However, it may lead to the assignment of irrelevant deletes to some data files during the + * job planning. All irrelevant deletes are filtered by readers but add extra overhead, which can be + * mitigated via caching. + * + * Under file granularity, delete writers always organize deletes by their target data file, + * creating separate delete files for each referenced data file. This strategy ensures the job + * planning does not assign irrelevant deletes to data files. However, it also increases the total Review Comment: "to data files which means no possibly extranousious delete information will be read unlike in `partition` granularity"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
RussellSpitzer commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439633731 ## core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.deletes; + +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; + +/** + * An enum that represents the granularity of deletes. + * + * Under partition granularity, delete writers are allowed to group deletes for different data + * files into one delete file. This strategy tends to reduce the total number of delete files in the + * table. However, it may lead to the assignment of irrelevant deletes to some data files during the + * job planning. All irrelevant deletes are filtered by readers but add extra overhead, which can be + * mitigated via caching. + * + * Under file granularity, delete writers always organize deletes by their target data file, + * creating separate delete files for each referenced data file. This strategy ensures the job + * planning does not assign irrelevant deletes to data files. However, it also increases the total + * number of delete files in the table and may require a more aggressive approach for delete file + * compaction. + * + * Currently, this configuration is only applicable to position deletes. + * + * Each granularity has its own benefits and drawbacks and should be picked based on a use case. + * Despite the chosen granularity, regular delete compaction remains necessary. It is also possible Review Comment: Despite -> Regardless of the -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
RussellSpitzer commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439633731 ## core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.deletes; + +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; + +/** + * An enum that represents the granularity of deletes. + * + * Under partition granularity, delete writers are allowed to group deletes for different data + * files into one delete file. This strategy tends to reduce the total number of delete files in the + * table. However, it may lead to the assignment of irrelevant deletes to some data files during the + * job planning. All irrelevant deletes are filtered by readers but add extra overhead, which can be + * mitigated via caching. + * + * Under file granularity, delete writers always organize deletes by their target data file, + * creating separate delete files for each referenced data file. This strategy ensures the job + * planning does not assign irrelevant deletes to data files. However, it also increases the total + * number of delete files in the table and may require a more aggressive approach for delete file + * compaction. + * + * Currently, this configuration is only applicable to position deletes. + * + * Each granularity has its own benefits and drawbacks and should be picked based on a use case. + * Despite the chosen granularity, regular delete compaction remains necessary. It is also possible Review Comment: Despite -> Regardless of the or maybe "Regular delete compaction is still required regardless of which granularity is chosen." -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
RussellSpitzer commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439642290 ## core/src/main/java/org/apache/iceberg/deletes/TargetedPositionDeleteWriter.java: ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.deletes; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.List; +import java.util.function.Supplier; +import org.apache.iceberg.DeleteFile; +import org.apache.iceberg.io.DeleteWriteResult; +import org.apache.iceberg.io.FileWriter; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.util.CharSequenceSet; + +/** + * A position delete writer that produces a separate delete file for each referenced data file. + * + * This writer does not keep track of seen deletes and assumes all incoming records are ordered + * by file and position as required by the spec. If there is no external process to order the + * records, consider using {@link SortingPositionOnlyDeleteWriter} instead. + */ +public class TargetedPositionDeleteWriter Review Comment: I think FileScoped or if you want a whole new name PerFilePostionDeleteFileWriter? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Bump Nessie to 0.76.0 [iceberg]
jbonofre commented on PR #9398: URL: https://github.com/apache/iceberg/pull/9398#issuecomment-1874300845 LGTM, @rdblue @nastra I would like to include for Iceberg 1.5.0. Thoughts ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Bump Nessie to 0.76.0 [iceberg]
jbonofre commented on PR #9398: URL: https://github.com/apache/iceberg/pull/9398#issuecomment-1874301447 Maybe worth to have a corresponding issue to populate the changelog/release notes. I will do it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Replace black by Ruff Formatter [iceberg-python]
rdblue commented on PR #127: URL: https://github.com/apache/iceberg-python/pull/127#issuecomment-1874330644 Looks good to me now. Thanks, @hussein-awala! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Replace black by Ruff Formatter [iceberg-python]
hussein-awala commented on PR #127: URL: https://github.com/apache/iceberg-python/pull/127#issuecomment-1874338838 I just merged main and fixed the conflicts, it should be ready to merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Data: Allow classes of different packages to implement DeleteFilter [iceberg]
RussellSpitzer commented on code in PR #9352: URL: https://github.com/apache/iceberg/pull/9352#discussion_r1439676088 ## core/src/main/java/org/apache/iceberg/util/DecimalUtil.java: ## @@ -31,7 +31,7 @@ private DecimalUtil() {} public static byte[] toReusedFixLengthBytes( int precision, int scale, BigDecimal decimal, byte[] reuseBuf) { Preconditions.checkArgument( -decimal.scale() == scale, +decimal.scale() <= scale, Review Comment: Is this safe? It is re-use fix length bytes but in this case we aren't using all the bytes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Data: Allow classes of different packages to implement DeleteFilter [iceberg]
RussellSpitzer commented on code in PR #9352: URL: https://github.com/apache/iceberg/pull/9352#discussion_r1439676343 ## data/src/main/java/org/apache/iceberg/data/DeleteFilter.java: ## @@ -138,7 +138,7 @@ public void incrementDeleteCount() { counter.increment(); } - Accessor posAccessor() { + public Accessor posAccessor() { Review Comment: Why do we want to open this up? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Data: Allow classes of different packages to implement DeleteFilter [iceberg]
RussellSpitzer commented on PR #9352: URL: https://github.com/apache/iceberg/pull/9352#issuecomment-1874346597 This seem like two different issues attempted to being solved in 1 PR. Please Separate them. I think you also need to offer a much more detailed explanation of why you want to make an api Public. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439678214 ## core/src/main/java/org/apache/iceberg/deletes/TargetedPositionDeleteWriter.java: ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.deletes; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.List; +import java.util.function.Supplier; +import org.apache.iceberg.DeleteFile; +import org.apache.iceberg.io.DeleteWriteResult; +import org.apache.iceberg.io.FileWriter; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.util.CharSequenceSet; + +/** + * A position delete writer that produces a separate delete file for each referenced data file. + * + * This writer does not keep track of seen deletes and assumes all incoming records are ordered + * by file and position as required by the spec. If there is no external process to order the + * records, consider using {@link SortingPositionOnlyDeleteWriter} instead. + */ +public class TargetedPositionDeleteWriter +implements FileWriter, DeleteWriteResult> { + + private final Supplier, DeleteWriteResult>> writers; + private final List deleteFiles; + private final CharSequenceSet referencedDataFiles; + + private FileWriter, DeleteWriteResult> currentWriter = null; + private CharSequence currentPath = null; + private boolean closed = false; + + public TargetedPositionDeleteWriter( + Supplier, DeleteWriteResult>> writers) { +this.writers = writers; +this.deleteFiles = Lists.newArrayList(); +this.referencedDataFiles = CharSequenceSet.empty(); + } + + @Override + public void write(PositionDelete positionDelete) { +writer(positionDelete.path()).write(positionDelete); + } + + private FileWriter, DeleteWriteResult> writer(CharSequence path) { +if (currentWriter == null) { + openCurrentWriter(path); +} else if (unequal(currentPath, path)) { + closeCurrentWriter(); + openCurrentWriter(path); +} + +return currentWriter; + } + + @Override + public long length() { +throw new UnsupportedOperationException(getClass().getName() + " does not implement length"); + } + + @Override + public DeleteWriteResult result() { +Preconditions.checkState(closed, "Cannot get result from unclosed writer"); +return new DeleteWriteResult(deleteFiles, referencedDataFiles); + } + + @Override + public void close() throws IOException { +if (!closed) { + closeCurrentWriter(); + this.closed = true; +} + } + + private void openCurrentWriter(CharSequence path) { +Preconditions.checkState(!closed, "Writer has already been closed"); +this.currentWriter = writers.get(); +this.currentPath = path; + } + + private void closeCurrentWriter() { +if (currentWriter != null) { + try { +currentWriter.close(); +DeleteWriteResult result = currentWriter.result(); +deleteFiles.addAll(result.deleteFiles()); +referencedDataFiles.addAll(result.referencedDataFiles()); +this.currentWriter = null; +this.currentPath = null; + } catch (IOException e) { +throw new UncheckedIOException("Failed to close current writer", e); + } +} + } + + private static boolean unequal(CharSequence s1, CharSequence s2) { Review Comment: Our `CharSeqComparator` is private. I've put this into a new utility class. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439678391 ## core/src/main/java/org/apache/iceberg/TableProperties.java: ## @@ -334,6 +335,9 @@ private TableProperties() {} public static final String MAX_REF_AGE_MS = "history.expire.max-ref-age-ms"; public static final long MAX_REF_AGE_MS_DEFAULT = Long.MAX_VALUE; + public static final String DELETE_GRANULARITY = "write.delete.granularity"; Review Comment: Added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on PR #9384: URL: https://github.com/apache/iceberg/pull/9384#issuecomment-1874349288 > One question: > Iceberg has the rewritePositionDeletesAction. Will this pr influence this action? @jerqi, yes, it will. There is a new test in `TestRewritePositionDeleteFilesAction` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439679439 ## core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.deletes; + +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; + +/** + * An enum that represents the granularity of deletes. + * + * Under partition granularity, delete writers are allowed to group deletes for different data Review Comment: Makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439679682 ## core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.deletes; + +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; + +/** + * An enum that represents the granularity of deletes. + * + * Under partition granularity, delete writers are allowed to group deletes for different data + * files into one delete file. This strategy tends to reduce the total number of delete files in the + * table. However, it may lead to the assignment of irrelevant deletes to some data files during the Review Comment: I like it, incorporated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Bump Nessie to 0.76.0 [iceberg]
snazy commented on PR #9398: URL: https://github.com/apache/iceberg/pull/9398#issuecomment-1874350376 The build-issue (fixed by the 2nd commit) is caused by a Java 21 class file, properly placed in `META-INF/versions/21/com/fasterxml/jackson/...`. There is yet no shadow-plugin version that can deal with Java 21, except a shadow plugin in a different fork. The "original" [shadow-plugin's](https://github.com/johnrengelman/shadow/) been unmaintained since March/April 2023 - there's [a fork](https://github.com/Goooler/shadow) that has changes to support Java 21 and Gradle 8.5+. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439679853 ## core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.deletes; + +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; + +/** + * An enum that represents the granularity of deletes. + * + * Under partition granularity, delete writers are allowed to group deletes for different data + * files into one delete file. This strategy tends to reduce the total number of delete files in the + * table. However, it may lead to the assignment of irrelevant deletes to some data files during the + * job planning. All irrelevant deletes are filtered by readers but add extra overhead, which can be + * mitigated via caching. + * + * Under file granularity, delete writers always organize deletes by their target data file, + * creating separate delete files for each referenced data file. This strategy ensures the job + * planning does not assign irrelevant deletes to data files. However, it also increases the total + * number of delete files in the table and may require a more aggressive approach for delete file + * compaction. + * + * Currently, this configuration is only applicable to position deletes. + * + * Each granularity has its own benefits and drawbacks and should be picked based on a use case. + * Despite the chosen granularity, regular delete compaction remains necessary. It is also possible Review Comment: Switched. ## core/src/main/java/org/apache/iceberg/deletes/DeleteGranularity.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.deletes; + +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; + +/** + * An enum that represents the granularity of deletes. + * + * Under partition granularity, delete writers are allowed to group deletes for different data + * files into one delete file. This strategy tends to reduce the total number of delete files in the + * table. However, it may lead to the assignment of irrelevant deletes to some data files during the + * job planning. All irrelevant deletes are filtered by readers but add extra overhead, which can be + * mitigated via caching. + * + * Under file granularity, delete writers always organize deletes by their target data file, + * creating separate delete files for each referenced data file. This strategy ensures the job + * planning does not assign irrelevant deletes to data files. However, it also increases the total Review Comment: Rewrote this part as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: Fix day partition transform result type [iceberg]
RussellSpitzer commented on code in PR #9345: URL: https://github.com/apache/iceberg/pull/9345#discussion_r1439682666 ## format/spec.md: ## @@ -318,7 +318,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ | **`truncate[W]`** | Value truncated to width `W` (see below) | `int`, `long`, `decimal`, `string` | Source type | | **`year`**| Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | -| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | +| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `date` | Review Comment: Agree with Fokko, this is a description of the transform which takes a date or timestamp and produces and int. If you are implementing this transform and return the implementation would be incorrect. Now if you implement this function and display the output as a different type (like the metadata table) that's fine. ## format/spec.md: ## @@ -318,7 +318,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ | **`truncate[W]`** | Value truncated to width `W` (see below) | `int`, `long`, `decimal`, `string` | Source type | | **`year`**| Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | -| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | +| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `date` | Review Comment: Agree with Fokko, this is a description of the transform which takes a date or timestamp and produces an int. If you are implementing this transform and return the implementation would be incorrect. Now if you implement this function and display the output as a different type (like the metadata table) that's fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: Fix day partition transform result type [iceberg]
RussellSpitzer commented on code in PR #9345: URL: https://github.com/apache/iceberg/pull/9345#discussion_r1439682666 ## format/spec.md: ## @@ -318,7 +318,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ | **`truncate[W]`** | Value truncated to width `W` (see below) | `int`, `long`, `decimal`, `string` | Source type | | **`year`**| Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | -| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | +| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `date` | Review Comment: Agree with Fokko, this is a description of the transform which takes a date or timestamp and produces an int. If you are implementing this transform and return a date the transform would be incorrect and incompatible with other Iceberg implementations. Now if you implement this function and display the output as a different type (like the metadata table) that's fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439685145 ## core/src/main/java/org/apache/iceberg/TableProperties.java: ## @@ -334,6 +335,9 @@ private TableProperties() {} public static final String MAX_REF_AGE_MS = "history.expire.max-ref-age-ms"; public static final long MAX_REF_AGE_MS_DEFAULT = Long.MAX_VALUE; + public static final String DELETE_GRANULARITY = "write.delete.granularity"; Review Comment: To be honest, I doubt we will ever support this property for equality deletes. In general, I do get that we may want to configure position and equality deletes differently. We can explore adding an extra namespace. I am still not sure this use case falls into that bucket. @rdblue @RussellSpitzer @zhongyujiang, thoughts? Do we want a prefix for this config to make it explicit that it only applies to position deletes? Currently, I only note that in the docs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Add ManifestWrite benchmark [iceberg]
dramaticlly commented on code in PR #8637: URL: https://github.com/apache/iceberg/pull/8637#discussion_r1439696239 ## core/src/jmh/java/org/apache/iceberg/ManifestWriteBenchmark.java: ## @@ -0,0 +1,166 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg; + +import java.io.File; +import java.io.IOException; +import java.io.UncheckedIOException; +import java.nio.ByteBuffer; +import java.util.Map; +import java.util.Random; +import java.util.UUID; +import java.util.concurrent.TimeUnit; +import org.apache.commons.io.FileUtils; +import org.apache.iceberg.io.OutputFile; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.relocated.com.google.common.io.Files; +import org.openjdk.jmh.annotations.Benchmark; +import org.openjdk.jmh.annotations.BenchmarkMode; +import org.openjdk.jmh.annotations.Fork; +import org.openjdk.jmh.annotations.Measurement; +import org.openjdk.jmh.annotations.Mode; +import org.openjdk.jmh.annotations.Param; +import org.openjdk.jmh.annotations.Scope; +import org.openjdk.jmh.annotations.Setup; +import org.openjdk.jmh.annotations.State; +import org.openjdk.jmh.annotations.TearDown; +import org.openjdk.jmh.annotations.Threads; +import org.openjdk.jmh.annotations.Timeout; + +/** + * A benchmark that evaluates the performance of writing manifest files + * + * To run this benchmark: + * ./gradlew :iceberg-core:jmh -PjmhIncludeRegex=ManifestWriteBenchmark + * + */ +@Fork(1) +@State(Scope.Benchmark) +@Measurement(iterations = 5) +@BenchmarkMode(Mode.SingleShotTime) +@Timeout(time = 5, timeUnit = TimeUnit.MINUTES) +public class ManifestWriteBenchmark { + + private static final int NUM_FILES = 10; + private static final int NUM_ROWS = 10; + private static final int NUM_COLS = 100; + + private String baseDir; + private String manifestListFile; + + private Metrics metrics; + + @Setup + public void before() { +Random random = new Random(System.currentTimeMillis()); +// Pre-create the metrics to avoid doing this in the benchmark itself +metrics = randomMetrics(random); + } + + @TearDown + public void after() { +if (baseDir != null) { + FileUtils.deleteQuietly(new File(baseDir)); + baseDir = null; +} + +manifestListFile = null; + } + + @State(Scope.Benchmark) + public static class BenchmarkState { +@Param({"1", "2"}) +public int formatVersion; Review Comment: ```suggestion int formatVersion; ``` checkstyle does not like it and I believe it does not need to be public if we always access it via state -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]
RussellSpitzer commented on code in PR #8755: URL: https://github.com/apache/iceberg/pull/8755#discussion_r1439698943 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkExecutorCache.java: ## @@ -0,0 +1,197 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark; + +import com.github.benmanes.caffeine.cache.Cache; +import com.github.benmanes.caffeine.cache.Caffeine; +import java.time.Duration; +import java.util.Collection; +import java.util.Collections; +import java.util.Map; +import java.util.function.Supplier; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.relocated.com.google.common.collect.Queues; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * An executor cache for optimizing tasks by reducing the computation and IO overhead. + * + * The cache is configurable and enabled through Spark SQL properties. Its key features include + * setting limits on the total cache size and maximum size for individual entries. Additionally, it + * implements automatic eviction of entries after a specified duration of inactivity. The cache will + * respect the SQL configuration valid at the time of initialization. All subsequent changes will + * have no effect. + * + * Usage pattern involves fetching data from the cache using a unique combination of execution ID + * and key. If the data is not present in the cache, it is computed using the provided supplier and + * stored in the cache, subject to the defined size constraints. + * + * Note that this class employs the singleton pattern to ensure only one cache exists per JVM. + * + * @see SparkUtil#executionId() + */ +public class SparkExecutorCache { + + private static final Logger LOG = LoggerFactory.getLogger(SparkExecutorCache.class); + + private static final SparkConfParser CONF_PARSER = new SparkConfParser(); + private static final boolean CACHE_ENABLED = parseCacheEnabledConf(); + private static final Duration TIMEOUT = parseTimeoutConf(); + private static final long MAX_ENTRY_SIZE = parseMaxEntrySizeConf(); + private static final long MAX_TOTAL_SIZE = parseMaxTotalSizeConf(); + private static final String EXECUTOR_DESC = SparkUtil.executorDesc(); + + private static volatile SparkExecutorCache instance = null; + + private final Map> keysByExecutionId; + private volatile Cache cache = null; + + private SparkExecutorCache() { +this.keysByExecutionId = Collections.synchronizedMap(Maps.newHashMap()); + } + + public static SparkExecutorCache getOrCreate() { +if (instance == null && CACHE_ENABLED) { + synchronized (SparkExecutorCache.class) { +if (instance == null) { + SparkExecutorCache.instance = new SparkExecutorCache(); +} + } +} + +return instance; + } + + public static void cleanUp(String executionId) { +if (instance != null) { Review Comment: Should we throw an error if instance == null and cleanUp gets called? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]
RussellSpitzer commented on code in PR #8755: URL: https://github.com/apache/iceberg/pull/8755#discussion_r1439702880 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkExecutorCache.java: ## @@ -0,0 +1,197 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark; + +import com.github.benmanes.caffeine.cache.Cache; +import com.github.benmanes.caffeine.cache.Caffeine; +import java.time.Duration; +import java.util.Collection; +import java.util.Collections; +import java.util.Map; +import java.util.function.Supplier; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.relocated.com.google.common.collect.Queues; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * An executor cache for optimizing tasks by reducing the computation and IO overhead. + * + * The cache is configurable and enabled through Spark SQL properties. Its key features include + * setting limits on the total cache size and maximum size for individual entries. Additionally, it + * implements automatic eviction of entries after a specified duration of inactivity. The cache will + * respect the SQL configuration valid at the time of initialization. All subsequent changes will + * have no effect. + * + * Usage pattern involves fetching data from the cache using a unique combination of execution ID + * and key. If the data is not present in the cache, it is computed using the provided supplier and + * stored in the cache, subject to the defined size constraints. + * + * Note that this class employs the singleton pattern to ensure only one cache exists per JVM. + * + * @see SparkUtil#executionId() + */ +public class SparkExecutorCache { + + private static final Logger LOG = LoggerFactory.getLogger(SparkExecutorCache.class); + + private static final SparkConfParser CONF_PARSER = new SparkConfParser(); + private static final boolean CACHE_ENABLED = parseCacheEnabledConf(); + private static final Duration TIMEOUT = parseTimeoutConf(); + private static final long MAX_ENTRY_SIZE = parseMaxEntrySizeConf(); + private static final long MAX_TOTAL_SIZE = parseMaxTotalSizeConf(); + private static final String EXECUTOR_DESC = SparkUtil.executorDesc(); + + private static volatile SparkExecutorCache instance = null; + + private final Map> keysByExecutionId; + private volatile Cache cache = null; + + private SparkExecutorCache() { +this.keysByExecutionId = Collections.synchronizedMap(Maps.newHashMap()); + } + + public static SparkExecutorCache getOrCreate() { +if (instance == null && CACHE_ENABLED) { + synchronized (SparkExecutorCache.class) { +if (instance == null) { + SparkExecutorCache.instance = new SparkExecutorCache(); +} + } +} + +return instance; + } + + public static void cleanUp(String executionId) { +if (instance != null) { + instance.invalidate(executionId); +} + } + + public long maxEntrySize() { +return MAX_ENTRY_SIZE; + } + + public V get(String executionId, String key, Supplier valueSupplier, long valueSize) { +if (valueSize > MAX_ENTRY_SIZE) { Review Comment: I'm a little confused here, if I try to get the cache here for this executionId and key, and it already is set in the cache I may still error out. For example get(id_1, key_1) - with 64mb Cache Entry // Inits cache get(id_1, key_1) - with same 64mb Cache entry // Fails although value is already set -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
rdblue commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439722314 ## core/src/main/java/org/apache/iceberg/TableProperties.java: ## @@ -334,6 +335,9 @@ private TableProperties() {} public static final String MAX_REF_AGE_MS = "history.expire.max-ref-age-ms"; public static final long MAX_REF_AGE_MS_DEFAULT = Long.MAX_VALUE; + public static final String DELETE_GRANULARITY = "write.delete.granularity"; Review Comment: This option makes no sense for equality deletes because they aren't targeted at a single file, so I agree that we won't support it for equality. This is also mostly advisory. It is unlikely that we will support it in Flink and will instead always use file-level granularity. Maybe we won't even want to support this in the long term, if we decide that Spark performs better with file granularity at all times. I guess where I'm at for this is that I would probably not worry much about it -- but also not add it to documentation since we will probably not want people setting it themselves. I think I'd leave it as `write.delete.granularity`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump pytest from 7.4.3 to 7.4.4 [iceberg-python]
Fokko merged PR #248: URL: https://github.com/apache/iceberg-python/pull/248 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Add S3 Access Grants Integration [iceberg]
jackye1995 commented on code in PR #9385: URL: https://github.com/apache/iceberg/pull/9385#discussion_r1439793304 ## aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java: ## @@ -749,4 +795,23 @@ public void applyEndpointConfigurations(T builder) { builder.endpointOverride(URI.create(endpoint)); } } + + /** + * Add the S3 Access Grants Plugin for an S3 client. + * + * Sample usage: + * + * + * S3Client.builder().applyMutation(s3FileIOProperties::applyS3AccessGrantsConfigurations) + * + */ + public void applyS3AccessGrantsConfigurations(T builder) { +if (isS3AccessGrantsEnabled) { + S3AccessGrantsPlugin s3AccessGrantsPlugin = Review Comment: > but while compiling the iceberg-aws module, you'll still need both ApacheHttpClient and UrlConnectionHttpClient classes. We are talking about runtime dependency on a package, not compile time. Yes compile time dependency is required. 2MB is still significant in my mind since we are really trying to minimize bundle size as much as possible. But the more important issue is that, people who use the current AWS bundle would get a failure unless they update their stack to include the plugin they do not use, which is going to be a blocker for upgrades. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Add S3 Access Grants Integration [iceberg]
jackye1995 commented on code in PR #9385: URL: https://github.com/apache/iceberg/pull/9385#discussion_r1437796852 ## aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java: ## @@ -749,4 +795,23 @@ public void applyEndpointConfigurations(T builder) { builder.endpointOverride(URI.create(endpoint)); } } + + /** + * Add the S3 Access Grants Plugin for an S3 client. + * + * Sample usage: + * + * + * S3Client.builder().applyMutation(s3FileIOProperties::applyS3AccessGrantsConfigurations) + * + */ + public void applyS3AccessGrantsConfigurations(T builder) { +if (isS3AccessGrantsEnabled) { + S3AccessGrantsPlugin s3AccessGrantsPlugin = Review Comment: I think there is an issue we need to handle here. This will force the users to have a dependency of the S3AccessGrantsPlugin library, even if they do not need this feature. To circumvent this, you can take a look at how we do that for things like HTTP client library dependencies: https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/ApacheHttpClientConfigurations.java. By moving all dependencies of a feature to a different class, it can avoid a runtime dependency. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Add S3 Access Grants Integration [iceberg]
jackye1995 commented on code in PR #9385: URL: https://github.com/apache/iceberg/pull/9385#discussion_r1439793304 ## aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java: ## @@ -749,4 +795,23 @@ public void applyEndpointConfigurations(T builder) { builder.endpointOverride(URI.create(endpoint)); } } + + /** + * Add the S3 Access Grants Plugin for an S3 client. + * + * Sample usage: + * + * + * S3Client.builder().applyMutation(s3FileIOProperties::applyS3AccessGrantsConfigurations) + * + */ + public void applyS3AccessGrantsConfigurations(T builder) { +if (isS3AccessGrantsEnabled) { + S3AccessGrantsPlugin s3AccessGrantsPlugin = Review Comment: > but while compiling the iceberg-aws module, you'll still need both ApacheHttpClient and UrlConnectionHttpClient classes. Sorry my typo, we are talking about runtime dependency on a package, not compile time. Yes compile time dependency is required. 2MB is still significant in my mind since we are really trying to minimize bundle size as much as possible. But the more important issue is that, people who use the current AWS bundle would get a failure unless they update their stack to include the plugin they do not use, which is going to be a blocker for upgrades. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439797790 ## core/src/main/java/org/apache/iceberg/TableProperties.java: ## @@ -334,6 +335,9 @@ private TableProperties() {} public static final String MAX_REF_AGE_MS = "history.expire.max-ref-age-ms"; public static final long MAX_REF_AGE_MS_DEFAULT = Long.MAX_VALUE; + public static final String DELETE_GRANULARITY = "write.delete.granularity"; Review Comment: The idea not to document it for now is reasonable given that it acts like a recommendation and we are not sure we want to support it in the future. Let's keep the name as is then. Adding a way to configure position and equality deletes separately is another discussion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on code in PR #9384: URL: https://github.com/apache/iceberg/pull/9384#discussion_r1439799553 ## core/src/main/java/org/apache/iceberg/TableProperties.java: ## @@ -334,6 +335,9 @@ private TableProperties() {} public static final String MAX_REF_AGE_MS = "history.expire.max-ref-age-ms"; public static final long MAX_REF_AGE_MS_DEFAULT = Long.MAX_VALUE; + public static final String DELETE_GRANULARITY = "write.delete.granularity"; Review Comment: Reverted based on [this](https://github.com/apache/iceberg/pull/9384/files#r1436606249) discussion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.3, 3.4: Backport #9255 - Fix clobbering of files across epochs [iceberg]
rdblue merged PR #9399: URL: https://github.com/apache/iceberg/pull/9399 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi merged PR #9384: URL: https://github.com/apache/iceberg/pull/9384 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Data, Spark 3.5: Support file and partition delete granularity [iceberg]
aokolnychyi commented on PR #9384: URL: https://github.com/apache/iceberg/pull/9384#issuecomment-187452 Thanks for reviewing, @jerqi @zinking @zhongyujiang @rdblue @RussellSpitzer! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Files are being overwritten on subsequent runs of Spark Structured Streaming [iceberg]
amogh-jahagirdar closed issue #8609: Files are being overwritten on subsequent runs of Spark Structured Streaming URL: https://github.com/apache/iceberg/issues/8609 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Files are being overwritten on subsequent runs of Spark Structured Streaming [iceberg]
amogh-jahagirdar commented on issue #8609: URL: https://github.com/apache/iceberg/issues/8609#issuecomment-1874704535 This issue should be resolved in https://github.com/apache/iceberg/pull/9255 and https://github.com/apache/iceberg/pull/9399 (backports to Spark 3.3 and Spark 3.4). This change should be released in the 1.5 release. I'm going to close this issue for now, please feel free to reopen in case you still have any doubts or questions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Rewrite manifest action can only split large manifest file into two manifests,instead of expected target size [iceberg]
github-actions[bot] closed issue #6891: Rewrite manifest action can only split large manifest file into two manifests,instead of expected target size URL: https://github.com/apache/iceberg/issues/6891 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Rewrite manifest action can only split large manifest file into two manifests,instead of expected target size [iceberg]
github-actions[bot] commented on issue #6891: URL: https://github.com/apache/iceberg/issues/6891#issuecomment-1874709553 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Renaming a table may conflict with the new table with old table name [iceberg]
github-actions[bot] commented on issue #6890: URL: https://github.com/apache/iceberg/issues/6890#issuecomment-1874709577 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Renaming a table may conflict with the new table with old table name [iceberg]
github-actions[bot] closed issue #6890: Renaming a table may conflict with the new table with old table name URL: https://github.com/apache/iceberg/issues/6890 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Deliver key metadata for encryption of data files [iceberg]
rdblue commented on code in PR #9359: URL: https://github.com/apache/iceberg/pull/9359#discussion_r1439975347 ## spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkAppenderFactory.java: ## @@ -161,7 +162,12 @@ private StructType lazyPosDeleteSparkType() { } @Override - public FileAppender newAppender(OutputFile file, FileFormat fileFormat) { + public FileAppender newAppender(OutputFile outputFile, FileFormat format) { +return newAppender(EncryptionUtil.plainAsEncryptedOutput(outputFile), format); + } + + @Override + public FileAppender newAppender(EncryptedOutputFile file, FileFormat fileFormat) { Review Comment: Ah, I was expecting these changes in Spark 3.5. Feel free to move them there, or we can add them to Spark 3.4 and open follow up PRs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Deliver key metadata for encryption of data files [iceberg]
rdblue commented on code in PR #9359: URL: https://github.com/apache/iceberg/pull/9359#discussion_r1439977630 ## core/src/main/java/org/apache/iceberg/encryption/StandardEncryptionManager.java: ## @@ -67,7 +72,15 @@ public InputFile decrypt(EncryptedInputFile encrypted) { @Override public Iterable decrypt(Iterable encrypted) { // Bulk decrypt is only applied to data files. Returning source input files for parquet. -return Iterables.transform(encrypted, this::decrypt); +if (nativeDataEncryption) { + return Iterables.transform(encrypted, this::getSourceFile); +} else { + return Iterables.transform(encrypted, this::decrypt); +} + } + + private InputFile getSourceFile(EncryptedInputFile encryptedFile) { Review Comment: Style: Iceberg method names should not include `get`. Instead, use a more helpful verb like `create` or `find`, or simply leave it out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Deliver key metadata for encryption of data files [iceberg]
rdblue commented on code in PR #9359: URL: https://github.com/apache/iceberg/pull/9359#discussion_r1439979355 ## core/src/main/java/org/apache/iceberg/encryption/StandardEncryptionManager.java: ## @@ -41,7 +42,10 @@ public class StandardEncryptionManager implements EncryptionManager { * @param kmsClient Client of KMS used to wrap/unwrap keys in envelope encryption */ public StandardEncryptionManager( - String tableKeyId, int dataKeyLength, KeyManagementClient kmsClient) { + String tableKeyId, + int dataKeyLength, + KeyManagementClient kmsClient, + boolean nativeDataEncryption) { Review Comment: I don't think that this should be passed in. The encryption manager needs to support files that use both native encryption (Parquet) and files that use AES GCM streams (Avro). There is no way to set this correctly because the behavior depends on the file type. Instead, this should _always_ call `decrypt` and the caller should know how to handle the resulting `InputFile`. If it is simply an `InputFile` then it should be used directly. If it is a `StandardDecryptedInputFile` then it should be used for Avro but trigger native decryption for Parquet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Deliver key metadata for encryption of data files [iceberg]
rdblue commented on code in PR #9359: URL: https://github.com/apache/iceberg/pull/9359#discussion_r1439979355 ## core/src/main/java/org/apache/iceberg/encryption/StandardEncryptionManager.java: ## @@ -41,7 +42,10 @@ public class StandardEncryptionManager implements EncryptionManager { * @param kmsClient Client of KMS used to wrap/unwrap keys in envelope encryption */ public StandardEncryptionManager( - String tableKeyId, int dataKeyLength, KeyManagementClient kmsClient) { + String tableKeyId, + int dataKeyLength, + KeyManagementClient kmsClient, + boolean nativeDataEncryption) { Review Comment: I don't think that this should be passed in. The encryption manager needs to support files that use both native encryption (Parquet) and files that use AES GCM streams (Avro). There is no way to set this correctly because the behavior depends on the file type. Instead, the caller should decide whether to call `decrypt` based on whether it is needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Add S3 Access Grants Integration [iceberg]
jackye1995 commented on code in PR #9385: URL: https://github.com/apache/iceberg/pull/9385#discussion_r1439983680 ## aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java: ## @@ -749,4 +795,23 @@ public void applyEndpointConfigurations(T builder) { builder.endpointOverride(URI.create(endpoint)); } } + + /** + * Add the S3 Access Grants Plugin for an S3 client. + * + * Sample usage: + * + * + * S3Client.builder().applyMutation(s3FileIOProperties::applyS3AccessGrantsConfigurations) + * + */ + public void applyS3AccessGrantsConfigurations(T builder) { +if (isS3AccessGrantsEnabled) { + S3AccessGrantsPlugin s3AccessGrantsPlugin = Review Comment: All the AWS dependencies are not packaged together with the iceberg-aws module: https://github.com/apache/iceberg/blob/main/build.gradle#L467-L476. This is because the users typically manage their own AWS SDK version that is very likely different from the one specified by this project. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Deliver key metadata for encryption of data files [iceberg]
rdblue commented on code in PR #9359: URL: https://github.com/apache/iceberg/pull/9359#discussion_r1439979355 ## core/src/main/java/org/apache/iceberg/encryption/StandardEncryptionManager.java: ## @@ -41,7 +42,10 @@ public class StandardEncryptionManager implements EncryptionManager { * @param kmsClient Client of KMS used to wrap/unwrap keys in envelope encryption */ public StandardEncryptionManager( - String tableKeyId, int dataKeyLength, KeyManagementClient kmsClient) { + String tableKeyId, + int dataKeyLength, + KeyManagementClient kmsClient, + boolean nativeDataEncryption) { Review Comment: I don't think that this should be passed in. The encryption manager needs to support files that use both native encryption (Parquet) and files that use AES GCM streams (Avro). There is no way to set this correctly because the behavior depends on the file type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Deliver key metadata for encryption of data files [iceberg]
rdblue commented on code in PR #9359: URL: https://github.com/apache/iceberg/pull/9359#discussion_r1439988090 ## core/src/main/java/org/apache/iceberg/encryption/StandardEncryptionManager.java: ## @@ -67,7 +72,15 @@ public InputFile decrypt(EncryptedInputFile encrypted) { @Override public Iterable decrypt(Iterable encrypted) { // Bulk decrypt is only applied to data files. Returning source input files for parquet. -return Iterables.transform(encrypted, this::decrypt); +if (nativeDataEncryption) { Review Comment: There is no way to know the intended file format or whether it will use native encryption at this point. I think this needs to always return the result of `decrypt`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Close the MetricsReporter when the Catalog is closed. [iceberg]
dramaticlly commented on code in PR #9353: URL: https://github.com/apache/iceberg/pull/9353#discussion_r1439989287 ## aws/src/main/java/org/apache/iceberg/aws/dynamodb/DynamoDbCatalog.java: ## @@ -487,6 +486,7 @@ public Configuration getConf() { @Override public void close() throws IOException { +super.close(); Review Comment: I think in order to close metric reporter when cloase the catalog, it can also be implemented to add `super.metricsReporter()` to closeableGroup on line 143, but this require to change the `BaseMetastoreCatalog.metricsReporter()` method from private to protected. This approach keep the `Closeable` in concrete class instead of abstract base class, and it seem to be more aligned with how other closable are handled in the this class. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Deliver key metadata for encryption of data files [iceberg]
rdblue commented on PR #9359: URL: https://github.com/apache/iceberg/pull/9359#issuecomment-187478 @ggershinsky, I'm still working through this review, so don't feel like you need to address or respond to my comments yet! Also, here are a few notes for myself when I pick this up tomorrow: * How do we ensure that all files that are committed to a table are encrypted? I think we should have a validation in `FastAppend` and `MergingSnapshotProducer` that verifies the encryption key metadata is non-null if the standard encryption manager is used. * There are places like `BaseTaskWriter.openCurrent` that also call `EncryptedOutputFile.encryptingOutputFile()` to get the location. We should look into whether those should use the underlying file path or need to be updated. * I think the way to solve the problem with `StandardEncryptionManager.decrypt` is to make `StandardDecryptedInputFile` implement both `InputFile` (that runs AES GCM decryption) and `EncryptedInputFile` (to provide access to key metadata and the encrypted underlying `InputFile`). Then the read path continues to pass around `InputFile`, but can check whether the file can be used via `EncryptedInputFile` for native decryption. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Close the MetricsReporter when the Catalog is closed. [iceberg]
huyuanfeng2018 commented on code in PR #9353: URL: https://github.com/apache/iceberg/pull/9353#discussion_r1440003882 ## aws/src/main/java/org/apache/iceberg/aws/dynamodb/DynamoDbCatalog.java: ## @@ -487,6 +486,7 @@ public Configuration getConf() { @Override public void close() throws IOException { +super.close(); Review Comment: > I think in order to close metric reporter when cloase the catalog, it can also be implemented to add `super.metricsReporter()` to closeableGroup on line 143, but this require to change the `BaseMetastoreCatalog.metricsReporter()` method from private to protected. > > This approach keep the `Closeable` in concrete class instead of abstract base class, and it seem to be more aligned with how other closable are handled in the this class. I agree with this statement. In the Java language, it is generally a common practice to call super.close() when a subclass overrides the method of the parent class, so I did do this at the beginning. If want to unify for the `closable` approach, it is indeed better to use `closeableGroup`. I will modify this part of the code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]
szehon-ho commented on code in PR #8755: URL: https://github.com/apache/iceberg/pull/8755#discussion_r1433435467 ## core/src/main/java/org/apache/iceberg/deletes/PositionDeleteIndex.java: ## @@ -44,4 +44,14 @@ public interface PositionDeleteIndex { /** Returns true if this collection contains no element. */ boolean isEmpty(); + + /** Returns true if this collection contains elements. */ + default boolean isNotEmpty() { Review Comment: is this really needed? (versus !empty?) ## core/src/main/java/org/apache/iceberg/deletes/PositionDeleteIndex.java: ## @@ -44,4 +44,14 @@ public interface PositionDeleteIndex { /** Returns true if this collection contains no element. */ boolean isEmpty(); + + /** Returns true if this collection contains elements. */ + default boolean isNotEmpty() { Review Comment: is this really needed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Add S3 Access Grants Integration [iceberg]
adnanhemani commented on code in PR #9385: URL: https://github.com/apache/iceberg/pull/9385#discussion_r1440008488 ## aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java: ## @@ -749,4 +795,23 @@ public void applyEndpointConfigurations(T builder) { builder.endpointOverride(URI.create(endpoint)); } } + + /** + * Add the S3 Access Grants Plugin for an S3 client. + * + * Sample usage: + * + * + * S3Client.builder().applyMutation(s3FileIOProperties::applyS3AccessGrantsConfigurations) + * + */ + public void applyS3AccessGrantsConfigurations(T builder) { +if (isS3AccessGrantsEnabled) { + S3AccessGrantsPlugin s3AccessGrantsPlugin = Review Comment: I don't think we're on the same page about this - the S3 Access Grants plugin code is separate from the remaining AWS SDK and therefore can be treated as any other 3rd-party dependency. But this is a moot point with the new change I've introduced. The size increase after these new changes are just 2.2KB (likely due to the source code introduced itself at this point). Please let me know if these changes are sufficient or if there are any other changes that need to be made. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Support filtering with buckets in RewriteDataFilesProcedure [iceberg]
manuzhang commented on PR #9396: URL: https://github.com/apache/iceberg/pull/9396#issuecomment-1874770664 @RussellSpitzer could you please give an example of passing partition(esp. bucket) via filter expression? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] how to integrations object storage ceph ? [iceberg]
hchautrung commented on issue #7158: URL: https://github.com/apache/iceberg/issues/7158#issuecomment-1874771632 Hi @jkl0898 , I am looking a solution to use Ceph with Iceberg. Currently I used MinIO but the we looking for an alterantive solution to replace MinIO. Could you share technical detail how to config Ceph with Iceberg. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Spark: Correct the delete record count for PartitionTable [iceberg]
ConeyLiu commented on PR #9389: URL: https://github.com/apache/iceberg/pull/9389#issuecomment-1874773959 Thanks @singhpk234 @RussellSpitzer @dramaticlly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]
szehon-ho commented on code in PR #8755: URL: https://github.com/apache/iceberg/pull/8755#discussion_r1434014371 ## data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java: ## @@ -0,0 +1,260 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.data; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.Queue; +import java.util.concurrent.ConcurrentLinkedQueue; +import java.util.concurrent.ExecutorService; +import java.util.function.Function; +import java.util.function.Supplier; +import org.apache.iceberg.DeleteFile; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.MetadataColumns; +import org.apache.iceberg.Schema; +import org.apache.iceberg.StructLike; +import org.apache.iceberg.avro.Avro; +import org.apache.iceberg.data.avro.DataReader; +import org.apache.iceberg.data.orc.GenericOrcReader; +import org.apache.iceberg.data.parquet.GenericParquetReaders; +import org.apache.iceberg.deletes.Deletes; +import org.apache.iceberg.deletes.PositionDeleteIndex; +import org.apache.iceberg.deletes.PositionDeleteIndexUtil; +import org.apache.iceberg.expressions.Expression; +import org.apache.iceberg.expressions.Expressions; +import org.apache.iceberg.io.CloseableIterable; +import org.apache.iceberg.io.DeleteSchemaUtil; +import org.apache.iceberg.io.InputFile; +import org.apache.iceberg.orc.ORC; +import org.apache.iceberg.orc.OrcRowReader; +import org.apache.iceberg.parquet.Parquet; +import org.apache.iceberg.parquet.ParquetValueReader; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.Iterables; +import org.apache.iceberg.relocated.com.google.common.math.LongMath; +import org.apache.iceberg.types.TypeUtil; +import org.apache.iceberg.util.CharSequenceMap; +import org.apache.iceberg.util.StructLikeSet; +import org.apache.iceberg.util.Tasks; +import org.apache.iceberg.util.ThreadPools; +import org.apache.orc.TypeDescription; +import org.apache.parquet.schema.MessageType; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class BaseDeleteLoader implements DeleteLoader { + + private static final Logger LOG = LoggerFactory.getLogger(BaseDeleteLoader.class); + private static final Schema POS_DELETE_SCHEMA = DeleteSchemaUtil.pathPosSchema(); + + private final Function loadInputFile; + private final ExecutorService workerPool; + + public BaseDeleteLoader(Function loadInputFile) { +this(loadInputFile, ThreadPools.getDeleteWorkerPool()); + } + + public BaseDeleteLoader( + Function loadInputFile, ExecutorService workerPool) { +this.loadInputFile = loadInputFile; +this.workerPool = workerPool; + } + + /** + * Checks if the given number of bytes can be cached. + * + * Implementations should override this method if they support caching. It is also recommended + * to use the provided size as a guideline to decide whether the value is eligible for caching. + * For instance, it may be beneficial to discard values that are too large to optimize the cache + * performance and utilization. + */ + protected boolean canCache(long size) { +return false; + } + + /** + * Gets the cached value for the key or populates the cache with a new mapping. + * + * If the value for the specified key is in the cache, it should be returned. If the value is + * not in the cache, implementations should compute the value using the provided supplier, cache + * it, and then return it. + * + * This method will be called only if {@link #canCache(long)} returned true. + */ + protected V get(String key, Supplier valueSupplier, long valueSize) { Review Comment: get() is a bit confusing, can we call it getOrLoad (like on the calling method), or getOrPopulate()? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org ---
Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]
szehon-ho commented on code in PR #8755: URL: https://github.com/apache/iceberg/pull/8755#discussion_r1434014371 ## data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java: ## @@ -0,0 +1,260 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.data; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.Queue; +import java.util.concurrent.ConcurrentLinkedQueue; +import java.util.concurrent.ExecutorService; +import java.util.function.Function; +import java.util.function.Supplier; +import org.apache.iceberg.DeleteFile; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.MetadataColumns; +import org.apache.iceberg.Schema; +import org.apache.iceberg.StructLike; +import org.apache.iceberg.avro.Avro; +import org.apache.iceberg.data.avro.DataReader; +import org.apache.iceberg.data.orc.GenericOrcReader; +import org.apache.iceberg.data.parquet.GenericParquetReaders; +import org.apache.iceberg.deletes.Deletes; +import org.apache.iceberg.deletes.PositionDeleteIndex; +import org.apache.iceberg.deletes.PositionDeleteIndexUtil; +import org.apache.iceberg.expressions.Expression; +import org.apache.iceberg.expressions.Expressions; +import org.apache.iceberg.io.CloseableIterable; +import org.apache.iceberg.io.DeleteSchemaUtil; +import org.apache.iceberg.io.InputFile; +import org.apache.iceberg.orc.ORC; +import org.apache.iceberg.orc.OrcRowReader; +import org.apache.iceberg.parquet.Parquet; +import org.apache.iceberg.parquet.ParquetValueReader; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.Iterables; +import org.apache.iceberg.relocated.com.google.common.math.LongMath; +import org.apache.iceberg.types.TypeUtil; +import org.apache.iceberg.util.CharSequenceMap; +import org.apache.iceberg.util.StructLikeSet; +import org.apache.iceberg.util.Tasks; +import org.apache.iceberg.util.ThreadPools; +import org.apache.orc.TypeDescription; +import org.apache.parquet.schema.MessageType; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class BaseDeleteLoader implements DeleteLoader { + + private static final Logger LOG = LoggerFactory.getLogger(BaseDeleteLoader.class); + private static final Schema POS_DELETE_SCHEMA = DeleteSchemaUtil.pathPosSchema(); + + private final Function loadInputFile; + private final ExecutorService workerPool; + + public BaseDeleteLoader(Function loadInputFile) { +this(loadInputFile, ThreadPools.getDeleteWorkerPool()); + } + + public BaseDeleteLoader( + Function loadInputFile, ExecutorService workerPool) { +this.loadInputFile = loadInputFile; +this.workerPool = workerPool; + } + + /** + * Checks if the given number of bytes can be cached. + * + * Implementations should override this method if they support caching. It is also recommended + * to use the provided size as a guideline to decide whether the value is eligible for caching. + * For instance, it may be beneficial to discard values that are too large to optimize the cache + * performance and utilization. + */ + protected boolean canCache(long size) { +return false; + } + + /** + * Gets the cached value for the key or populates the cache with a new mapping. + * + * If the value for the specified key is in the cache, it should be returned. If the value is + * not in the cache, implementations should compute the value using the provided supplier, cache + * it, and then return it. + * + * This method will be called only if {@link #canCache(long)} returned true. + */ + protected V get(String key, Supplier valueSupplier, long valueSize) { Review Comment: get() is a bit confusing, can we call it getOrLoad (like on the calling method), or maybe better computeIfAbsent()? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]
szehon-ho commented on code in PR #8755: URL: https://github.com/apache/iceberg/pull/8755#discussion_r1434014371 ## data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java: ## @@ -0,0 +1,260 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.data; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.Queue; +import java.util.concurrent.ConcurrentLinkedQueue; +import java.util.concurrent.ExecutorService; +import java.util.function.Function; +import java.util.function.Supplier; +import org.apache.iceberg.DeleteFile; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.MetadataColumns; +import org.apache.iceberg.Schema; +import org.apache.iceberg.StructLike; +import org.apache.iceberg.avro.Avro; +import org.apache.iceberg.data.avro.DataReader; +import org.apache.iceberg.data.orc.GenericOrcReader; +import org.apache.iceberg.data.parquet.GenericParquetReaders; +import org.apache.iceberg.deletes.Deletes; +import org.apache.iceberg.deletes.PositionDeleteIndex; +import org.apache.iceberg.deletes.PositionDeleteIndexUtil; +import org.apache.iceberg.expressions.Expression; +import org.apache.iceberg.expressions.Expressions; +import org.apache.iceberg.io.CloseableIterable; +import org.apache.iceberg.io.DeleteSchemaUtil; +import org.apache.iceberg.io.InputFile; +import org.apache.iceberg.orc.ORC; +import org.apache.iceberg.orc.OrcRowReader; +import org.apache.iceberg.parquet.Parquet; +import org.apache.iceberg.parquet.ParquetValueReader; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.Iterables; +import org.apache.iceberg.relocated.com.google.common.math.LongMath; +import org.apache.iceberg.types.TypeUtil; +import org.apache.iceberg.util.CharSequenceMap; +import org.apache.iceberg.util.StructLikeSet; +import org.apache.iceberg.util.Tasks; +import org.apache.iceberg.util.ThreadPools; +import org.apache.orc.TypeDescription; +import org.apache.parquet.schema.MessageType; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class BaseDeleteLoader implements DeleteLoader { + + private static final Logger LOG = LoggerFactory.getLogger(BaseDeleteLoader.class); + private static final Schema POS_DELETE_SCHEMA = DeleteSchemaUtil.pathPosSchema(); + + private final Function loadInputFile; + private final ExecutorService workerPool; + + public BaseDeleteLoader(Function loadInputFile) { +this(loadInputFile, ThreadPools.getDeleteWorkerPool()); + } + + public BaseDeleteLoader( + Function loadInputFile, ExecutorService workerPool) { +this.loadInputFile = loadInputFile; +this.workerPool = workerPool; + } + + /** + * Checks if the given number of bytes can be cached. + * + * Implementations should override this method if they support caching. It is also recommended + * to use the provided size as a guideline to decide whether the value is eligible for caching. + * For instance, it may be beneficial to discard values that are too large to optimize the cache + * performance and utilization. + */ + protected boolean canCache(long size) { +return false; + } + + /** + * Gets the cached value for the key or populates the cache with a new mapping. + * + * If the value for the specified key is in the cache, it should be returned. If the value is + * not in the cache, implementations should compute the value using the provided supplier, cache + * it, and then return it. + * + * This method will be called only if {@link #canCache(long)} returned true. + */ + protected V get(String key, Supplier valueSupplier, long valueSize) { Review Comment: get() is a bit confusing, can we call it getOrLoad (like on the calling method), or maybe better getOrDefault() as in Java map API (renaming the other arguments to be default)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infras
[PR] Spark 3.5: Set log level to WARN for rewrite task failure with partial progress [iceberg]
manuzhang opened a new pull request, #9400: URL: https://github.com/apache/iceberg/pull/9400 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Request distribution and ordering for writes [iceberg]
maytasm commented on PR #3461: URL: https://github.com/apache/iceberg/pull/3461#issuecomment-1874846377 If in Spark 3.3, users no longer have to explicitly sort their data before INSERT into a partitioned table, then should we insert global sort into the plan rather than a local sort? https://github.com/apache/iceberg/pull/3461/files#diff-e13fee34cfc35da918a0a1a668758085888e87ba9eb36685121a4fad009fa649R74 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Suppress exceptions in case of dropTableData [iceberg]
nk1506 commented on PR #9184: URL: https://github.com/apache/iceberg/pull/9184#issuecomment-1874868435 @Fokko , A gentle reminder for the same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Ensure that partition stats files are considered for GC procedures [iceberg]
ajantha-bhat commented on PR #9284: URL: https://github.com/apache/iceberg/pull/9284#issuecomment-1874869959 ping @aokolnychyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org