[I] S3 compression Issue with Iceberg [iceberg]
swat1234 opened a new issue, #8713: URL: https://github.com/apache/iceberg/issues/8713 Iceberg tables not compressing parquet file in s3. When the below Table parameters are used for the Compression the file size is increasing in comparison with uncompression. Can some one please assist on the same. 1. File with UNCOMPRESSED codec. 0-0-0129ba78-17f6-466f-b57b-695c678d64d5-1.parquet === size 682 bytes }, "properties" : { "codec" : "UNCOMPRESSED", --- 2. File with gzip codec 733 bytes 0-0-e6f22c0e-2e16-43aa-8a5f-efabee995876-1.parquet "properties" : { "codec" : "GZIP", --- 3. File with code snappy codec 686 bytes. 0-0-36fd4aad-8c38-40f5-8241-78ffe4f0a032-1.parquet "codec" : "SNAPPY", "path" : { -- Table Properties: "parquet.compression": "SNAPPY" "read.parquet.vectorization.batch-size": "5000" "read.split.target-size": "134217728" "read.parquet.vectorization.enabled": "true" "write.parquet.page-size-bytes": "1048576" "write.parquet.row-group-size-bytes": "134217728" "write_compression": "SNAPPY" "write.parquet.compression-codec": "snappy" "write.metadata.metrics.max-inferred-column-defaults": "100" "write.parquet.compression-level": "4" "write.target-file-size-bytes": "536870912" "write.delete.target-file-size-bytes": "67108864" "write.parquet.page-row-limit": "2" "write.format.default": "parquet" "write.metadata.compression-codec": "gzip" "write.compression": "SNAPPY" Thanks in advance!! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Support view metadata compression [iceberg]
nastra commented on code in PR #8552: URL: https://github.com/apache/iceberg/pull/8552#discussion_r1345331616 ## core/src/test/java/org/apache/iceberg/view/TestViewMetadataParser.java: ## @@ -308,4 +322,57 @@ public void replaceViewMetadataWithMultipleSQLsForDialect() throws Exception { assertThat(replaced.currentVersion()).isEqualTo(viewVersion); } + + @ParameterizedTest + @ValueSource(strings = {"none", "gzip"}) + public void metadataCompression(String codecName) throws IOException { +Codec codec = Codec.fromName(codecName); +String location = Paths.get(tmp.toString(), "v1" + getFileExtension(codec)).toString(); Review Comment: I was aligning this test with how compression was tested for table metadata in https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/core/src/test/java/org/apache/iceberg/TableMetadataParserTest.java#L63 but I've changed it to have explicit file names -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Allow setting a View's location [iceberg]
nastra commented on code in PR #8648: URL: https://github.com/apache/iceberg/pull/8648#discussion_r1345339487 ## api/src/main/java/org/apache/iceberg/view/UpdateViewLocation.java: ## @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import org.apache.iceberg.PendingUpdate; + +/** API for setting a view's base location. */ +public interface UpdateViewLocation extends PendingUpdate { Review Comment: yes, we could use `UpdateLocation` here instead of introducing a separate API -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Allow setting a View's location [iceberg]
nastra commented on code in PR #8648: URL: https://github.com/apache/iceberg/pull/8648#discussion_r1345349415 ## core/src/main/java/org/apache/iceberg/view/SetViewLocation.java: ## @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import static org.apache.iceberg.TableProperties.COMMIT_MAX_RETRY_WAIT_MS; +import static org.apache.iceberg.TableProperties.COMMIT_MAX_RETRY_WAIT_MS_DEFAULT; +import static org.apache.iceberg.TableProperties.COMMIT_MIN_RETRY_WAIT_MS; +import static org.apache.iceberg.TableProperties.COMMIT_MIN_RETRY_WAIT_MS_DEFAULT; +import static org.apache.iceberg.TableProperties.COMMIT_NUM_RETRIES; +import static org.apache.iceberg.TableProperties.COMMIT_NUM_RETRIES_DEFAULT; +import static org.apache.iceberg.TableProperties.COMMIT_TOTAL_RETRY_TIME_MS; +import static org.apache.iceberg.TableProperties.COMMIT_TOTAL_RETRY_TIME_MS_DEFAULT; + +import org.apache.iceberg.exceptions.CommitFailedException; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.util.PropertyUtil; +import org.apache.iceberg.util.Tasks; + +class SetViewLocation implements UpdateViewLocation { + private final ViewOperations ops; + private ViewMetadata base; + private String newLocation = null; + + SetViewLocation(ViewOperations ops) { +this.ops = ops; +this.base = ops.current(); + } + + @Override + public String apply() { +return internalApply().location(); Review Comment: makes sense, I've aligned the logic with how it's done for tables in `SetLocation` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Allow setting a View's location [iceberg]
nastra commented on PR #8648: URL: https://github.com/apache/iceberg/pull/8648#issuecomment-1746330496 thanks for the reviews @rdblue and @amogh-jahagirdar, I've adjusted the code accordingly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Deprecate usages of AssertHelpers in codebase [iceberg]
gzagarwal commented on issue #7094: URL: https://github.com/apache/iceberg/issues/7094#issuecomment-1746338892 > okay let me pick, I am working on iceberg-aws Shall i assume current test cases are working? on my local system they are not working so asking this question. Is there any perquisite to work on the iceberg workspace .Some settings do i need to run to work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] S3 compression Issue with Iceberg [iceberg]
nastra commented on issue #8713: URL: https://github.com/apache/iceberg/issues/8713#issuecomment-1746347077 I see that you configured `"write.metadata.compression-codec": "gzip"` but this is for table metadata files being compressed, not individual data files. Also any particular reason to set `parquet.compression` / `write.compression` / ... and all the others? The setting that controls this is `write.parquet.compression-codec`, which defaults to `gzip`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Deprecate usages of AssertHelpers in codebase [iceberg]
nastra commented on issue #7094: URL: https://github.com/apache/iceberg/issues/7094#issuecomment-1746356259 @gzagarwal yes the tests should all be working. What issue are you seeing? https://github.com/apache/iceberg/blob/a3aff95f9e60962240b94242e24a778760bdd1d9/CONTRIBUTING.md and https://iceberg.apache.org/contribute/#building-the-project-locally should contain everything needed in order to configure the project locally -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] S3 compression Issue with Iceberg [iceberg]
swat1234 commented on issue #8713: URL: https://github.com/apache/iceberg/issues/8713#issuecomment-1746360361 I am are trying to reduce the storage space of the files by applying Snappy or Gzip compression. I can see metadata is getting compression to gzip but not the data files. Could you guide me on how to do it, -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] S3 compression Issue with Iceberg [iceberg]
nastra commented on issue #8713: URL: https://github.com/apache/iceberg/issues/8713#issuecomment-1746404632 I would probably start first by reducing the amount of random table properties being set. As I mentioned earlier, the one that matters in your case is `write.parquet.compression-codec`, which defaults to `gzip`, but can also be set to `snappy` or `zstd`. The other setting you can experiment with is `write.parquet.compression-level`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Upgrade to gradle 8.3 [iceberg]
jbonofre commented on issue #8485: URL: https://github.com/apache/iceberg/issues/8485#issuecomment-1746467781 FYI, I tested `revapi` with Gradle 8.3 (on my PR). Here's the test I did: * I added `void test();` method in `SessionCatalog` * I added the corresponding `public void test() {}` in `RESTSessionCatalog` So there's actually a change on the API. With Gradle 8.1, `revapi` fails with `API/ABI breaks detected.`. With Gradle 8.3, `revapi` doesn't fail, it doesn't detect the API change. I'm investigating. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Allow setting a View's location [iceberg]
nastra merged PR #8648: URL: https://github.com/apache/iceberg/pull/8648 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Deprecate usages of AssertHelpers in codebase [iceberg]
gzagarwal commented on issue #7094: URL: https://github.com/apache/iceberg/issues/7094#issuecomment-1746492707 > @gzagarwal yes the tests should all be working. What issue are you seeing? https://github.com/apache/iceberg/blob/a3aff95f9e60962240b94242e24a778760bdd1d9/CONTRIBUTING.md and https://iceberg.apache.org/contribute/#building-the-project-locally should contain everything needed in order to configure the project locally Some issue with my local system only, let me fix that first , if someone wants to take care of the changes they can pick it up. if my workspace works locally then will come back., -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Upgrade to gradle 8.3 [iceberg]
ajantha-bhat commented on issue #8485: URL: https://github.com/apache/iceberg/issues/8485#issuecomment-1746502923 > With Gradle 8.3, revapi doesn't fail, it doesn't detect the API change. Yes. Thats what we have observed with Gradle 8.2 also. Maybe we need to raise an issue to revAPI team. If it is a Gradle issue they will divert us to there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: new sink base on the unified sink API - WIP [iceberg]
gyfora commented on code in PR #8653: URL: https://github.com/apache/iceberg/pull/8653#discussion_r1345500193 ## flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java: ## @@ -18,72 +18,39 @@ */ package org.apache.iceberg.flink.sink; -import static org.apache.iceberg.TableProperties.AVRO_COMPRESSION; -import static org.apache.iceberg.TableProperties.AVRO_COMPRESSION_LEVEL; -import static org.apache.iceberg.TableProperties.ORC_COMPRESSION; -import static org.apache.iceberg.TableProperties.ORC_COMPRESSION_STRATEGY; -import static org.apache.iceberg.TableProperties.PARQUET_COMPRESSION; -import static org.apache.iceberg.TableProperties.PARQUET_COMPRESSION_LEVEL; -import static org.apache.iceberg.TableProperties.WRITE_DISTRIBUTION_MODE; - -import java.io.IOException; -import java.io.UncheckedIOException; -import java.time.Duration; import java.util.List; import java.util.Map; -import java.util.Set; -import java.util.function.Function; import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.api.common.typeinfo.TypeInformation; import org.apache.flink.api.common.typeinfo.Types; -import org.apache.flink.configuration.Configuration; -import org.apache.flink.configuration.ReadableConfig; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.datastream.DataStreamSink; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.functions.sink.DiscardingSink; import org.apache.flink.table.api.TableSchema; import org.apache.flink.table.data.RowData; -import org.apache.flink.table.data.util.DataFormatConverters; -import org.apache.flink.table.types.DataType; import org.apache.flink.table.types.logical.RowType; import org.apache.flink.types.Row; -import org.apache.iceberg.DataFile; -import org.apache.iceberg.DistributionMode; -import org.apache.iceberg.FileFormat; -import org.apache.iceberg.PartitionField; -import org.apache.iceberg.PartitionSpec; -import org.apache.iceberg.Schema; -import org.apache.iceberg.SerializableTable; import org.apache.iceberg.Table; -import org.apache.iceberg.flink.FlinkSchemaUtil; import org.apache.iceberg.flink.FlinkWriteConf; -import org.apache.iceberg.flink.FlinkWriteOptions; -import org.apache.iceberg.flink.TableLoader; -import org.apache.iceberg.flink.util.FlinkCompatibilityUtil; +import org.apache.iceberg.flink.sink.committer.IcebergFilesCommitter; +import org.apache.iceberg.flink.sink.writer.IcebergStreamWriter; import org.apache.iceberg.io.WriteResult; import org.apache.iceberg.relocated.com.google.common.annotations.VisibleForTesting; -import org.apache.iceberg.relocated.com.google.common.base.Preconditions; -import org.apache.iceberg.relocated.com.google.common.collect.Lists; -import org.apache.iceberg.relocated.com.google.common.collect.Maps; -import org.apache.iceberg.relocated.com.google.common.collect.Sets; -import org.apache.iceberg.types.TypeUtil; import org.apache.iceberg.util.SerializableSupplier; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -public class FlinkSink { - private static final Logger LOG = LoggerFactory.getLogger(FlinkSink.class); +public class FlinkSink extends SinkBase { Review Comment: Isn't this going to cause backward compatibility issues with the current FlinkSink instances? Maybe I am wrong but it means that users can only upgrade the iceberg version if they re-build their job graph (re-serialize operators etc.) ## flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/committer/CommonCommitter.java: ## @@ -0,0 +1,426 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.sink.committer; + +import java.io.IOException; +import java.io.Serializable; +import java.util.Arrays; +import java.util.Collection; +import java.util.List; +import java.util.Map; +import java.util.NavigableMap; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.TimeUnit; +import javax.annotation.Nullable; +import org.apache.flink.api.connector.sink2.Committer; +import org.apache.flink.core.io.SimpleVersionedSerialization; +import org
Re: [PR] Docs: Document all metadata tables. [iceberg]
nk1506 commented on code in PR #8709: URL: https://github.com/apache/iceberg/pull/8709#discussion_r1345504986 ## docs/spark-queries.md: ## @@ -357,6 +381,31 @@ SELECT * FROM prod.db.table.all_data_files; | 0|s3://.../dt=20210103/0-0-26222098-032f-472b-8ea5-651a55b21210-1.parquet| PARQUET|{20210103}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210103}|{1 -> 3, 2 -> 20210103}|null| [4]|null|0| | 0|s3://.../dt=20210104/0-0-a3bb1927-88eb-4f1c-bc6e-19076b0d952e-1.parquet| PARQUET|{20210104}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210104}|{1 -> 3, 2 -> 20210104}|null| [4]|null|0| + All Delete Files + +To show all the table's delete files and each file's metadata: Review Comment: followed the same convention of previous tables like `files` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Document all metadata tables. [iceberg]
ajantha-bhat commented on code in PR #8709: URL: https://github.com/apache/iceberg/pull/8709#discussion_r1345506789 ## docs/spark-queries.md: ## @@ -357,6 +381,31 @@ SELECT * FROM prod.db.table.all_data_files; | 0|s3://.../dt=20210103/0-0-26222098-032f-472b-8ea5-651a55b21210-1.parquet| PARQUET|{20210103}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210103}|{1 -> 3, 2 -> 20210103}|null| [4]|null|0| | 0|s3://.../dt=20210104/0-0-a3bb1927-88eb-4f1c-bc6e-19076b0d952e-1.parquet| PARQUET|{20210104}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210104}|{1 -> 3, 2 -> 20210104}|null| [4]|null|0| + All Delete Files + +To show all the table's delete files and each file's metadata: Review Comment: ok. The existing convention is still confusing for me to read. But we can optimize in a follow up. Ok for me to keep it similar now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] feat: In memory catalog [iceberg-rust]
JanKaul opened a new pull request, #74: URL: https://github.com/apache/iceberg-rust/pull/74 This is a draft PR to implement some functionality for an in memory catalog. The in memory catalog is supposed to simplify tests. Additionally this PR serves as a way to test the requirements for the `Catalog` trait for catalogs that make use of the `metadata_location` of an Iceberg Table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Could there be duplicate values in the result returned by the findOrphanFiles method? [iceberg]
nk1506 commented on issue #8670: URL: https://github.com/apache/iceberg/issues/8670#issuecomment-1746511694 @RussellSpitzer , I want to look into it and fix it accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Upgrade to gradle 8.3 [iceberg]
jbonofre commented on issue #8485: URL: https://github.com/apache/iceberg/issues/8485#issuecomment-1746517112 I think the problem is more on gradle or a mix with gradle and revapi gradle plugin. I'm doing a bisect on gradle to identify the change causing the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: new sink base on the unified sink API - WIP [iceberg]
gyfora commented on code in PR #8653: URL: https://github.com/apache/iceberg/pull/8653#discussion_r1345500193 ## flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java: ## @@ -18,72 +18,39 @@ */ package org.apache.iceberg.flink.sink; -import static org.apache.iceberg.TableProperties.AVRO_COMPRESSION; -import static org.apache.iceberg.TableProperties.AVRO_COMPRESSION_LEVEL; -import static org.apache.iceberg.TableProperties.ORC_COMPRESSION; -import static org.apache.iceberg.TableProperties.ORC_COMPRESSION_STRATEGY; -import static org.apache.iceberg.TableProperties.PARQUET_COMPRESSION; -import static org.apache.iceberg.TableProperties.PARQUET_COMPRESSION_LEVEL; -import static org.apache.iceberg.TableProperties.WRITE_DISTRIBUTION_MODE; - -import java.io.IOException; -import java.io.UncheckedIOException; -import java.time.Duration; import java.util.List; import java.util.Map; -import java.util.Set; -import java.util.function.Function; import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.api.common.typeinfo.TypeInformation; import org.apache.flink.api.common.typeinfo.Types; -import org.apache.flink.configuration.Configuration; -import org.apache.flink.configuration.ReadableConfig; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.datastream.DataStreamSink; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.functions.sink.DiscardingSink; import org.apache.flink.table.api.TableSchema; import org.apache.flink.table.data.RowData; -import org.apache.flink.table.data.util.DataFormatConverters; -import org.apache.flink.table.types.DataType; import org.apache.flink.table.types.logical.RowType; import org.apache.flink.types.Row; -import org.apache.iceberg.DataFile; -import org.apache.iceberg.DistributionMode; -import org.apache.iceberg.FileFormat; -import org.apache.iceberg.PartitionField; -import org.apache.iceberg.PartitionSpec; -import org.apache.iceberg.Schema; -import org.apache.iceberg.SerializableTable; import org.apache.iceberg.Table; -import org.apache.iceberg.flink.FlinkSchemaUtil; import org.apache.iceberg.flink.FlinkWriteConf; -import org.apache.iceberg.flink.FlinkWriteOptions; -import org.apache.iceberg.flink.TableLoader; -import org.apache.iceberg.flink.util.FlinkCompatibilityUtil; +import org.apache.iceberg.flink.sink.committer.IcebergFilesCommitter; +import org.apache.iceberg.flink.sink.writer.IcebergStreamWriter; import org.apache.iceberg.io.WriteResult; import org.apache.iceberg.relocated.com.google.common.annotations.VisibleForTesting; -import org.apache.iceberg.relocated.com.google.common.base.Preconditions; -import org.apache.iceberg.relocated.com.google.common.collect.Lists; -import org.apache.iceberg.relocated.com.google.common.collect.Maps; -import org.apache.iceberg.relocated.com.google.common.collect.Sets; -import org.apache.iceberg.types.TypeUtil; import org.apache.iceberg.util.SerializableSupplier; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -public class FlinkSink { - private static final Logger LOG = LoggerFactory.getLogger(FlinkSink.class); +public class FlinkSink extends SinkBase { Review Comment: Isn't this going to cause backward compatibility issues with the current FlinkSink instances? Maybe I am wrong but it means that users can only upgrade the iceberg version if they re-build their job graph (re-serialize operators etc.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Add View support for REST catalog [iceberg]
nastra commented on code in PR #7913: URL: https://github.com/apache/iceberg/pull/7913#discussion_r1345527136 ## core/src/main/java/org/apache/iceberg/catalog/BaseSessionCatalog.java: ## @@ -30,8 +30,10 @@ import org.apache.iceberg.exceptions.NamespaceNotEmptyException; import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet; +import org.apache.iceberg.view.View; +import org.apache.iceberg.view.ViewBuilder; -public abstract class BaseSessionCatalog implements SessionCatalog { +public abstract class BaseSessionCatalog implements SessionCatalog, ViewSessionCatalog { Review Comment: in this case, maybe we should have a `BaseViewSessionCatalog` that extends `BaseSessionCatalog`? This would be similar to `BaseMetastoreCatalog` & `BaseMetastoreViewCatalog`. Also I'm not sure why RevAPI didn't complain -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Adopt `Catalog` API to include references to the `TableMetadata` and the `metadata_location` in the `TableCommit` payload for the `update_table` method [iceberg-rust]
JanKaul opened a new issue, #75: URL: https://github.com/apache/iceberg-rust/issues/75 Iceberg catalogs that make use of a `*.metadata.json` file to store the table metadata require the `metadata_location` and the `TableMetadata` of a Table to perform an `update_table` operation ([see here](https://github.com/JanKaul/iceberg-rust/blob/memory-catalog/crates/iceberg/src/catalog/memory.rs#L176)). It would therefore be helpful to include references to the `metadata_location` and the `TableMetadata` in the `TableCommit` payload of the `update_table` operation. Something like: ```rust pub struct TableCommit<'t> { /// The table ident. pub ident: TableIdent, /// Metadata file location of the table pub metadata_location: &'t str, /// Table metadata pub table_metadata: &'t TableMetadata, /// The requirements of the table. /// /// Commit will fail if the requirements are not met. pub requirements: Vec, /// The updates of the table. pub updates: Vec, } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Optimize metadata tables? [iceberg]
ajantha-bhat commented on issue #8714: URL: https://github.com/apache/iceberg/issues/8714#issuecomment-1746561829 cc: @szehon-ho -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Docs: document the compareWithFileList parameter [iceberg]
Tavisca-vinayak-bhadage commented on issue #8155: URL: https://github.com/apache/iceberg/issues/8155#issuecomment-1746581142 This compareWithFileList would be good solution for AWS S3 based iceberg tables also. As we are facing below exception with default remove orphan file implementation : `org.apache.iceberg.exceptions.RuntimeIOException: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown;)` Is there any way to use compareWithFileList parameter from pyspark or spark-sql. As we are having AWS Glue as spark engine which is not supporting java Data Set api. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Could there be duplicate values in the result returned by the findOrphanFiles method? [iceberg]
nk1506 commented on issue #8670: URL: https://github.com/apache/iceberg/issues/8670#issuecomment-1746623860 @hwfff , could you please share the stack-trace if handy? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] S3 compression Issue with Iceberg [iceberg]
swat1234 commented on issue #8713: URL: https://github.com/apache/iceberg/issues/8713#issuecomment-1746699625 We tried with only write.parquet.compression-codec parameter set to snappy, gzip but it is not working. Instead of compressing, the size is getting increased. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] S3 compression Issue with Iceberg [iceberg]
RussellSpitzer commented on issue #8713: URL: https://github.com/apache/iceberg/issues/8713#issuecomment-1746708682 If you are only trying with sub kilobyte files the results will be bad. You have some amortized costs there and most of the file (footers) will not be compressed. Try with larger files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: new sink base on the unified sink API - WIP [iceberg]
gyfora commented on code in PR #8653: URL: https://github.com/apache/iceberg/pull/8653#discussion_r1345697890 ## flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/SinkBase.java: ## @@ -0,0 +1,326 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.sink; + +import static org.apache.iceberg.TableProperties.AVRO_COMPRESSION; +import static org.apache.iceberg.TableProperties.AVRO_COMPRESSION_LEVEL; +import static org.apache.iceberg.TableProperties.ORC_COMPRESSION; +import static org.apache.iceberg.TableProperties.ORC_COMPRESSION_STRATEGY; +import static org.apache.iceberg.TableProperties.PARQUET_COMPRESSION; +import static org.apache.iceberg.TableProperties.PARQUET_COMPRESSION_LEVEL; +import static org.apache.iceberg.TableProperties.WRITE_DISTRIBUTION_MODE; + +import java.io.IOException; +import java.io.Serializable; +import java.io.UncheckedIOException; +import java.time.Duration; +import java.util.List; +import java.util.Map; +import java.util.Set; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.table.api.TableSchema; +import org.apache.flink.table.data.RowData; +import org.apache.flink.table.types.logical.RowType; +import org.apache.iceberg.DistributionMode; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.PartitionField; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Schema; +import org.apache.iceberg.SerializableTable; +import org.apache.iceberg.Table; +import org.apache.iceberg.flink.FlinkSchemaUtil; +import org.apache.iceberg.flink.FlinkWriteConf; +import org.apache.iceberg.flink.TableLoader; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.relocated.com.google.common.collect.Sets; +import org.apache.iceberg.types.TypeUtil; +import org.apache.iceberg.util.SerializableSupplier; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class SinkBase implements Serializable { + private static final Logger LOG = LoggerFactory.getLogger(SinkBase.class); + + private transient SinkBuilder builder; + private List equalityFieldIds; + private RowType flinkRowType; + // Can not be serialized, so we have to be careful and serialize specific fields when needed + private transient FlinkWriteConf flinkWriteConf; + private Map writeProperties; + private SerializableSupplier tableSupplier; + + SinkBase(SinkBuilder builder) { +this.builder = builder; + } + + SinkBuilder builder() { +return builder; + } + + FlinkWriteConf flinkWriteConf() { +return flinkWriteConf; + } + + Map writeProperties() { +return writeProperties; + } + + SerializableSupplier tableSupplier() { +return tableSupplier; + } + + RowType flinkRowType() { +return flinkRowType; + } + + List equalityFieldIds() { Review Comment: Why not simply protected / package private fields and access them directly to reduce boilerplate? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: new sink base on the unified sink API - WIP [iceberg]
gyfora commented on code in PR #8653: URL: https://github.com/apache/iceberg/pull/8653#discussion_r1345702939 ## flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java: ## @@ -0,0 +1,276 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.sink; + +import java.util.Map; +import java.util.UUID; +import org.apache.flink.api.common.functions.MapFunction; +import org.apache.flink.api.common.typeinfo.TypeInformation; +import org.apache.flink.api.connector.sink2.Committer; +import org.apache.flink.core.io.SimpleVersionedSerializer; +import org.apache.flink.streaming.api.connector.sink2.CommittableMessage; +import org.apache.flink.streaming.api.connector.sink2.CommittableMessageTypeInfo; +import org.apache.flink.streaming.api.connector.sink2.WithPostCommitTopology; +import org.apache.flink.streaming.api.connector.sink2.WithPreCommitTopology; +import org.apache.flink.streaming.api.connector.sink2.WithPreWriteTopology; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.datastream.DataStreamSink; +import org.apache.flink.table.api.TableSchema; +import org.apache.flink.table.data.RowData; +import org.apache.flink.types.Row; +import org.apache.iceberg.DistributionMode; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.Table; +import org.apache.iceberg.flink.TableLoader; +import org.apache.iceberg.flink.sink.committer.CommonCommitter; +import org.apache.iceberg.flink.sink.committer.SinkV2Aggregator; +import org.apache.iceberg.flink.sink.committer.SinkV2Committable; +import org.apache.iceberg.flink.sink.committer.SinkV2CommittableSerializer; +import org.apache.iceberg.flink.sink.committer.SinkV2Committer; +import org.apache.iceberg.flink.sink.writer.IcebergStreamWriterMetrics; +import org.apache.iceberg.flink.sink.writer.RowDataTaskWriterFactory; +import org.apache.iceberg.flink.sink.writer.SinkV2Writer; + +/** + * Flink v2 sink offer different hooks to insert custom topologies into the sink. We will use the + * following: + * + * + * {@link WithPreWriteTopology} which redistributes the data to the writers based on the + * {@link DistributionMode} + * {@link org.apache.flink.api.connector.sink2.SinkWriter} which writes data files, and + * generates a {@link SinkV2Committable} to store the {@link + * org.apache.iceberg.io.WriteResult} + * {@link WithPreCommitTopology} which we use to to place the {@link SinkV2Aggregator} which + * merges the individual {@link org.apache.flink.api.connector.sink2.SinkWriter}'s {@link + * org.apache.iceberg.io.WriteResult}s to a single {@link + * org.apache.iceberg.flink.sink.committer.DeltaManifests} + * {@link Committer} which stores the incoming {@link + * org.apache.iceberg.flink.sink.committer.DeltaManifests}s in state for recovery, and commits + * them to the Iceberg table using the {@link CommonCommitter} + * {@link WithPostCommitTopology} we could use for incremental compaction later. This is not + * implemented yet. + * + * + * The job graph looks like below: + * + * {@code + *Flink sink + * +---+ + * | | + * +---+ | +--+ +-+ +---+ | + * | Map 1 | ==> | | writer 1 | | committer 1 | ---> | post commit 1 | | + * +---+ | +--+ +-+ +---+ | + * | \ / \ | + * | \ / \ | + * | \ / \| + * +---+ | +--+ \ +---+ / +-+ \ +---+ | + * | Map 2 | ==> | | writer 2 | --->| commit aggregator | |
Re: [PR] Flink: new sink base on the unified sink API - WIP [iceberg]
gyfora commented on code in PR #8653: URL: https://github.com/apache/iceberg/pull/8653#discussion_r1345705474 ## flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java: ## @@ -0,0 +1,276 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.sink; + +import java.util.Map; +import java.util.UUID; +import org.apache.flink.api.common.functions.MapFunction; +import org.apache.flink.api.common.typeinfo.TypeInformation; +import org.apache.flink.api.connector.sink2.Committer; +import org.apache.flink.core.io.SimpleVersionedSerializer; +import org.apache.flink.streaming.api.connector.sink2.CommittableMessage; +import org.apache.flink.streaming.api.connector.sink2.CommittableMessageTypeInfo; +import org.apache.flink.streaming.api.connector.sink2.WithPostCommitTopology; +import org.apache.flink.streaming.api.connector.sink2.WithPreCommitTopology; +import org.apache.flink.streaming.api.connector.sink2.WithPreWriteTopology; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.datastream.DataStreamSink; +import org.apache.flink.table.api.TableSchema; +import org.apache.flink.table.data.RowData; +import org.apache.flink.types.Row; +import org.apache.iceberg.DistributionMode; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.Table; +import org.apache.iceberg.flink.TableLoader; +import org.apache.iceberg.flink.sink.committer.CommonCommitter; +import org.apache.iceberg.flink.sink.committer.SinkV2Aggregator; +import org.apache.iceberg.flink.sink.committer.SinkV2Committable; +import org.apache.iceberg.flink.sink.committer.SinkV2CommittableSerializer; +import org.apache.iceberg.flink.sink.committer.SinkV2Committer; +import org.apache.iceberg.flink.sink.writer.IcebergStreamWriterMetrics; +import org.apache.iceberg.flink.sink.writer.RowDataTaskWriterFactory; +import org.apache.iceberg.flink.sink.writer.SinkV2Writer; + +/** + * Flink v2 sink offer different hooks to insert custom topologies into the sink. We will use the + * following: + * + * + * {@link WithPreWriteTopology} which redistributes the data to the writers based on the + * {@link DistributionMode} + * {@link org.apache.flink.api.connector.sink2.SinkWriter} which writes data files, and + * generates a {@link SinkV2Committable} to store the {@link + * org.apache.iceberg.io.WriteResult} + * {@link WithPreCommitTopology} which we use to to place the {@link SinkV2Aggregator} which + * merges the individual {@link org.apache.flink.api.connector.sink2.SinkWriter}'s {@link + * org.apache.iceberg.io.WriteResult}s to a single {@link + * org.apache.iceberg.flink.sink.committer.DeltaManifests} + * {@link Committer} which stores the incoming {@link + * org.apache.iceberg.flink.sink.committer.DeltaManifests}s in state for recovery, and commits + * them to the Iceberg table using the {@link CommonCommitter} + * {@link WithPostCommitTopology} we could use for incremental compaction later. This is not + * implemented yet. + * + * + * The job graph looks like below: + * + * {@code + *Flink sink + * +---+ + * | | + * +---+ | +--+ +-+ +---+ | + * | Map 1 | ==> | | writer 1 | | committer 1 | ---> | post commit 1 | | + * +---+ | +--+ +-+ +---+ | + * | \ / \ | + * | \ / \ | + * | \ / \| + * +---+ | +--+ \ +---+ / +-+ \ +---+ | + * | Map 2 | ==> | | writer 2 | --->| commit aggregator | |
Re: [PR] Flink: new sink base on the unified sink API - WIP [iceberg]
gyfora commented on code in PR #8653: URL: https://github.com/apache/iceberg/pull/8653#discussion_r1345702939 ## flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java: ## @@ -0,0 +1,276 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.sink; + +import java.util.Map; +import java.util.UUID; +import org.apache.flink.api.common.functions.MapFunction; +import org.apache.flink.api.common.typeinfo.TypeInformation; +import org.apache.flink.api.connector.sink2.Committer; +import org.apache.flink.core.io.SimpleVersionedSerializer; +import org.apache.flink.streaming.api.connector.sink2.CommittableMessage; +import org.apache.flink.streaming.api.connector.sink2.CommittableMessageTypeInfo; +import org.apache.flink.streaming.api.connector.sink2.WithPostCommitTopology; +import org.apache.flink.streaming.api.connector.sink2.WithPreCommitTopology; +import org.apache.flink.streaming.api.connector.sink2.WithPreWriteTopology; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.datastream.DataStreamSink; +import org.apache.flink.table.api.TableSchema; +import org.apache.flink.table.data.RowData; +import org.apache.flink.types.Row; +import org.apache.iceberg.DistributionMode; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.Table; +import org.apache.iceberg.flink.TableLoader; +import org.apache.iceberg.flink.sink.committer.CommonCommitter; +import org.apache.iceberg.flink.sink.committer.SinkV2Aggregator; +import org.apache.iceberg.flink.sink.committer.SinkV2Committable; +import org.apache.iceberg.flink.sink.committer.SinkV2CommittableSerializer; +import org.apache.iceberg.flink.sink.committer.SinkV2Committer; +import org.apache.iceberg.flink.sink.writer.IcebergStreamWriterMetrics; +import org.apache.iceberg.flink.sink.writer.RowDataTaskWriterFactory; +import org.apache.iceberg.flink.sink.writer.SinkV2Writer; + +/** + * Flink v2 sink offer different hooks to insert custom topologies into the sink. We will use the + * following: + * + * + * {@link WithPreWriteTopology} which redistributes the data to the writers based on the + * {@link DistributionMode} + * {@link org.apache.flink.api.connector.sink2.SinkWriter} which writes data files, and + * generates a {@link SinkV2Committable} to store the {@link + * org.apache.iceberg.io.WriteResult} + * {@link WithPreCommitTopology} which we use to to place the {@link SinkV2Aggregator} which + * merges the individual {@link org.apache.flink.api.connector.sink2.SinkWriter}'s {@link + * org.apache.iceberg.io.WriteResult}s to a single {@link + * org.apache.iceberg.flink.sink.committer.DeltaManifests} + * {@link Committer} which stores the incoming {@link + * org.apache.iceberg.flink.sink.committer.DeltaManifests}s in state for recovery, and commits + * them to the Iceberg table using the {@link CommonCommitter} + * {@link WithPostCommitTopology} we could use for incremental compaction later. This is not + * implemented yet. + * + * + * The job graph looks like below: + * + * {@code + *Flink sink + * +---+ + * | | + * +---+ | +--+ +-+ +---+ | + * | Map 1 | ==> | | writer 1 | | committer 1 | ---> | post commit 1 | | + * +---+ | +--+ +-+ +---+ | + * | \ / \ | + * | \ / \ | + * | \ / \| + * +---+ | +--+ \ +---+ / +-+ \ +---+ | + * | Map 2 | ==> | | writer 2 | --->| commit aggregator | |
Re: [PR] Flink: new sink base on the unified sink API - WIP [iceberg]
gyfora commented on code in PR #8653: URL: https://github.com/apache/iceberg/pull/8653#discussion_r1345706466 ## flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java: ## @@ -0,0 +1,276 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.sink; + +import java.util.Map; +import java.util.UUID; +import org.apache.flink.api.common.functions.MapFunction; +import org.apache.flink.api.common.typeinfo.TypeInformation; +import org.apache.flink.api.connector.sink2.Committer; +import org.apache.flink.core.io.SimpleVersionedSerializer; +import org.apache.flink.streaming.api.connector.sink2.CommittableMessage; +import org.apache.flink.streaming.api.connector.sink2.CommittableMessageTypeInfo; +import org.apache.flink.streaming.api.connector.sink2.WithPostCommitTopology; +import org.apache.flink.streaming.api.connector.sink2.WithPreCommitTopology; +import org.apache.flink.streaming.api.connector.sink2.WithPreWriteTopology; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.datastream.DataStreamSink; +import org.apache.flink.table.api.TableSchema; +import org.apache.flink.table.data.RowData; +import org.apache.flink.types.Row; +import org.apache.iceberg.DistributionMode; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.Table; +import org.apache.iceberg.flink.TableLoader; +import org.apache.iceberg.flink.sink.committer.CommonCommitter; +import org.apache.iceberg.flink.sink.committer.SinkV2Aggregator; +import org.apache.iceberg.flink.sink.committer.SinkV2Committable; +import org.apache.iceberg.flink.sink.committer.SinkV2CommittableSerializer; +import org.apache.iceberg.flink.sink.committer.SinkV2Committer; +import org.apache.iceberg.flink.sink.writer.IcebergStreamWriterMetrics; +import org.apache.iceberg.flink.sink.writer.RowDataTaskWriterFactory; +import org.apache.iceberg.flink.sink.writer.SinkV2Writer; + +/** + * Flink v2 sink offer different hooks to insert custom topologies into the sink. We will use the + * following: + * + * + * {@link WithPreWriteTopology} which redistributes the data to the writers based on the + * {@link DistributionMode} + * {@link org.apache.flink.api.connector.sink2.SinkWriter} which writes data files, and + * generates a {@link SinkV2Committable} to store the {@link + * org.apache.iceberg.io.WriteResult} + * {@link WithPreCommitTopology} which we use to to place the {@link SinkV2Aggregator} which + * merges the individual {@link org.apache.flink.api.connector.sink2.SinkWriter}'s {@link + * org.apache.iceberg.io.WriteResult}s to a single {@link + * org.apache.iceberg.flink.sink.committer.DeltaManifests} + * {@link Committer} which stores the incoming {@link + * org.apache.iceberg.flink.sink.committer.DeltaManifests}s in state for recovery, and commits + * them to the Iceberg table using the {@link CommonCommitter} + * {@link WithPostCommitTopology} we could use for incremental compaction later. This is not + * implemented yet. + * + * + * The job graph looks like below: + * + * {@code + *Flink sink + * +---+ + * | | + * +---+ | +--+ +-+ +---+ | + * | Map 1 | ==> | | writer 1 | | committer 1 | ---> | post commit 1 | | + * +---+ | +--+ +-+ +---+ | + * | \ / \ | + * | \ / \ | + * | \ / \| + * +---+ | +--+ \ +---+ / +-+ \ +---+ | + * | Map 2 | ==> | | writer 2 | --->| commit aggregator | |
Re: [I] Writing to S3 fails if the user is authenticated with `aws sso login` [iceberg-python]
jayceslesar commented on issue #39: URL: https://github.com/apache/iceberg-python/issues/39#issuecomment-1746873989 This is confirmed an upstream bug in pyarrow 13.0.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] S3 compression Issue with Iceberg [iceberg]
amogh-jahagirdar commented on issue #8713: URL: https://github.com/apache/iceberg/issues/8713#issuecomment-1746920616 +1 to @RussellSpitzer point. These files seem way too small for compression to play a significant role and be meaningful. Compression is most noticeable on significant amounts of "similar" data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Construct a writer tree [iceberg-python]
Fokko opened a new pull request, #40: URL: https://github.com/apache/iceberg-python/pull/40 For V1 and V2 there are some differences that are hard to enforce without this: - `1: snapshot_id` is required for V1, optional for V2 - `105: block_size_in_bytes` needs to be written for V1, but omitted for V2 (this leverages the `write-default`). - `3: sequence_number` and `4: file_sequence_number` can be omited for V1. Everything that we read, we map it to V2. However, when writing we also want to be compliant with the V1 spec, and this is where the writer tree comes in since we construct a tree for V1 or V2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Construct a writer tree [iceberg-python]
Fokko commented on code in PR #40: URL: https://github.com/apache/iceberg-python/pull/40#discussion_r1345991299 ## pyiceberg/manifest.py: ## @@ -262,15 +346,13 @@ class DataFile(Record): "split_offsets", "equality_ids", "sort_order_id", -"spec_id", ) content: DataFileContent file_path: str file_format: FileFormat partition: Record record_count: int file_size_in_bytes: int -block_size_in_bytes: Optional[int] Review Comment: I've removed this since we shouldn't use it. It is still part of the schema itself. ## pyiceberg/manifest.py: ## @@ -281,7 +363,6 @@ class DataFile(Record): split_offsets: Optional[List[int]] equality_ids: Optional[List[int]] sort_order_id: Optional[int] -spec_id: Optional[int] Review Comment: Removed this for now. Created an issue: https://github.com/apache/iceberg/issues/8712 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Your branch name [iceberg]
shreyanshR7 opened a new pull request, #8715: URL: https://github.com/apache/iceberg/pull/8715 #7154 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Add View support for REST catalog [iceberg]
nastra commented on code in PR #7913: URL: https://github.com/apache/iceberg/pull/7913#discussion_r1346067694 ## core/src/test/java/org/apache/iceberg/rest/RESTCatalogAdapter.java: ## @@ -568,4 +649,9 @@ private static TableIdentifier identFromPathVars(Map pathVars) { return TableIdentifier.of( namespaceFromPathVars(pathVars), RESTUtil.decodeString(pathVars.get("table"))); } + + private static TableIdentifier viewIdentFromPathVars(Map pathVars) { Review Comment: yep, good idea -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Add View support for REST catalog [iceberg]
nastra commented on code in PR #7913: URL: https://github.com/apache/iceberg/pull/7913#discussion_r1346082143 ## open-api/rest-catalog-open-api.yaml: ## @@ -1014,6 +1014,357 @@ paths: } } + /v1/{prefix}/namespaces/{namespace}/views: +parameters: + - $ref: '#/components/parameters/prefix' + - $ref: '#/components/parameters/namespace' + +get: + tags: +- Catalog API + summary: List all view identifiers underneath a given namespace + description: Return all view identifiers under this namespace + operationId: listViews + responses: +200: + $ref: '#/components/responses/ListTablesResponse' +400: + $ref: '#/components/responses/BadRequestErrorResponse' +401: + $ref: '#/components/responses/UnauthorizedResponse' +403: + $ref: '#/components/responses/ForbiddenResponse' +404: + description: Not Found - The namespace specified does not exist + content: +application/json: + schema: +$ref: '#/components/schemas/ErrorModel' + examples: +NamespaceNotFound: + $ref: '#/components/examples/NoSuchNamespaceError' +419: + $ref: '#/components/responses/AuthenticationTimeoutResponse' +503: + $ref: '#/components/responses/ServiceUnavailableResponse' +5XX: + $ref: '#/components/responses/ServerErrorResponse' + +post: + tags: +- Catalog API + summary: Create a view in the given namespace + description: +Create a view in the given namespace. + operationId: createView + requestBody: +content: + application/json: +schema: + $ref: '#/components/schemas/CreateViewRequest' + responses: +200: + $ref: '#/components/responses/LoadViewResponse' +400: + $ref: '#/components/responses/BadRequestErrorResponse' +401: + $ref: '#/components/responses/UnauthorizedResponse' +403: + $ref: '#/components/responses/ForbiddenResponse' +404: + description: Not Found - The namespace specified does not exist + content: +application/json: + schema: +$ref: '#/components/schemas/ErrorModel' + examples: +NamespaceNotFound: + $ref: '#/components/examples/NoSuchNamespaceError' +409: + description: Conflict - The view already exists + content: +application/json: + schema: +$ref: '#/components/schemas/ErrorModel' + examples: +NamespaceAlreadyExists: + $ref: '#/components/examples/ViewAlreadyExistsError' +419: + $ref: '#/components/responses/AuthenticationTimeoutResponse' +503: + $ref: '#/components/responses/ServiceUnavailableResponse' +5XX: + $ref: '#/components/responses/ServerErrorResponse' + + /v1/{prefix}/namespaces/{namespace}/views/{view}: +parameters: + - $ref: '#/components/parameters/prefix' + - $ref: '#/components/parameters/namespace' + - $ref: '#/components/parameters/view' + +get: + tags: +- Catalog API + summary: Load a view from the catalog + operationId: loadView + description: +Load a view from the catalog. + + +The response contains both configuration and table metadata. The configuration, if non-empty is used +as additional configuration for the view that overrides catalog configuration. For example, this +configuration may change the FileIO implementation to be used for the view. + + +The response also contains the view's full metadata, matching the view metadata JSON file. + + +The catalog configuration may contain credentials that should be used for subsequent requests for the +view. The configuration key "token" is used to pass an access token to be used as a bearer token +for view requests. Otherwise, a token may be passed using a RFC 8693 token type as a configuration +key. For example, "urn:ietf:params:oauth:token-type:jwt=". + responses: +200: + $ref: '#/components/responses/LoadViewResponse' +400: + $ref: '#/components/responses/BadRequestErrorResponse' +401: + $ref: '#/components/responses/UnauthorizedResponse' +403: + $ref: '#/components/responses/ForbiddenResponse' +404: + description: +Not Found - NoSuchViewException, view to load does not exist + content: +application/json: + schema: +$ref: '#/components/schemas/ErrorModel' + examples: +ViewToLoadDoesNotExist: +
[PR] OpenAPI: Add AssignUUID update to metadata updates [iceberg]
nastra opened a new pull request, #8716: URL: https://github.com/apache/iceberg/pull/8716 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] WIP: Write support [iceberg-python]
Fokko opened a new pull request, #41: URL: https://github.com/apache/iceberg-python/pull/41 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Replace Thread.sleep() usage in test code with Awaitility [iceberg]
shreyanshR7 commented on issue #7154: URL: https://github.com/apache/iceberg/issues/7154#issuecomment-1747275843 @nastra I tried to implement the above method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Thread.sleep() method is replaced with Awaitility [iceberg]
nk1506 commented on code in PR #8715: URL: https://github.com/apache/iceberg/pull/8715#discussion_r1346198636 ## api/src/test/java/org/apache/iceberg/metrics/TestDefaultTimer.java: ## @@ -104,7 +106,7 @@ public void measureRunnable() { Runnable runnable = () -> { try { -Thread.sleep(100); +Awaitility.await().atLeast(Duration.ofMillis(100)).until(() -> true); Review Comment: won't it become flaky, if callable condition is evaluated sooner. `if (evaluationDuration.compareTo(minWaitTime) < 0) { String message = String.format("Condition was evaluated in %s which is earlier than expected minimum timeout %s", formatAsString(evaluationDuration), formatAsString(minWaitTime)); conditionEvaluationHandler.handleTimeout(message, true); throw new ConditionTimeoutException(message); }` Why not use something like `https://github.com/apache/iceberg/blob/master/api/src/test/java/org/apache/iceberg/TestHelpers.java#L57` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Thread.sleep() method is replaced with Awaitility [iceberg]
shreyanshR7 commented on PR #8715: URL: https://github.com/apache/iceberg/pull/8715#issuecomment-1747371647 Oh i see, the code uses a while loop checks the current time until the condition is met.But its asked to replace Thread.sleep method with awaitility, should i implement your suggestion as it don't uses awaitility.@nk1506 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Parquet: Support filter operations on int96 timestamps [iceberg]
thesquelched closed pull request #2563: Parquet: Support filter operations on int96 timestamps URL: https://github.com/apache/iceberg/pull/2563 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Push filtering for Iceberg table type to Hive MetaStore when listing tables [iceberg]
mderoy commented on PR #2722: URL: https://github.com/apache/iceberg/pull/2722#issuecomment-1747635549 @hankfanchiu this is awesome. the change in performance by doing this is exponential... any chance on reviving this? or did the external issues put a damper on this? maybe we can do this via catalog properties like the 'list-all-tables' property as a workaround if we're hung up on the external issues :thinking: if you don't have plans on moving this forward I can give it a go :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Phase 1 - New Docs Deployment [iceberg]
aokolnychyi commented on code in PR #8659: URL: https://github.com/apache/iceberg/pull/8659#discussion_r1346544156 ## docs-new/site/releases.md: ## @@ -0,0 +1,777 @@ +--- +title: "Releases" +--- + + +## Downloads + +The latest version of Iceberg is [{{ icebergVersion }}](https://github.com/apache/iceberg/releases/tag/apache-iceberg-{{ icebergVersion }}). + +* [{{ icebergVersion }} source tar.gz](https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-{{ icebergVersion }}/apache-iceberg-{{ icebergVersion }}.tar.gz) -- [signature](https://downloads.apache.org/iceberg/apache-iceberg-{{ icebergVersion }}/apache-iceberg-{{ icebergVersion }}.tar.gz.asc) -- [sha512](https://downloads.apache.org/iceberg/apache-iceberg-{{ icebergVersion }}/apache-iceberg-{{ icebergVersion }}.tar.gz.sha512) +* [{{ icebergVersion }} Spark 3.4\_2.12 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.4_2.12/{{ icebergVersion }}/iceberg-spark-runtime-3.4_2.12-{{ icebergVersion }}.jar) -- [3.4\_2.13](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.4_2.13/{{ icebergVersion }}/iceberg-spark-runtime-3.4_2.13-{{ icebergVersion }}.jar) +* [{{ icebergVersion }} Spark 3.3\_2.12 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/{{ icebergVersion }}/iceberg-spark-runtime-3.3_2.12-{{ icebergVersion }}.jar) -- [3.3\_2.13](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.3_2.13/{{ icebergVersion }}/iceberg-spark-runtime-3.3_2.13-{{ icebergVersion }}.jar) +* [{{ icebergVersion }} Spark 3.2\_2.12 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/{{ icebergVersion }}/iceberg-spark-runtime-3.2_2.12-{{ icebergVersion }}.jar) -- [3.2\_2.13](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.2_2.13/{{ icebergVersion }}/iceberg-spark-runtime-3.2_2.13-{{ icebergVersion }}.jar) +* [{{ icebergVersion }} Spark 3.1 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.1_2.12/{{ icebergVersion }}/iceberg-spark-runtime-3.1_2.12-{{ icebergVersion }}.jar) +* [{{ icebergVersion }} Flink 1.17 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.17/{{ icebergVersion }}/iceberg-flink-runtime-1.17-{{ icebergVersion }}.jar) +* [{{ icebergVersion }} Flink 1.16 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.16/{{ icebergVersion }}/iceberg-flink-runtime-1.16-{{ icebergVersion }}.jar) +* [{{ icebergVersion }} Flink 1.15 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.15/{{ icebergVersion }}/iceberg-flink-runtime-1.15-{{ icebergVersion }}.jar) +* [{{ icebergVersion }} Hive runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/{{ icebergVersion }}/iceberg-hive-runtime-{{ icebergVersion }}.jar) + +To use Iceberg in Spark or Flink, download the runtime JAR for your engine version and add it to the jars folder of your installation. + +To use Iceberg in Hive 2 or Hive 3, download the Hive runtime JAR and add it to Hive using `ADD JAR`. + +### Gradle + +To add a dependency on Iceberg in Gradle, add the following to `build.gradle`: + +``` +dependencies { + compile 'org.apache.iceberg:iceberg-core:{{ icebergVersion }}' +} +``` + +You may also want to include `iceberg-parquet` for Parquet file support. + +### Maven + +To add a dependency on Iceberg in Maven, add the following to your `pom.xml`: + +``` + + ... + +org.apache.iceberg +iceberg-core +{{ icebergVersion }} + + ... + +``` + +## 1.3.1 release + +Apache Iceberg 1.3.1 was released on July 25, 2023. +The 1.3.1 release addresses various issues identified in the 1.3.0 release. + +* Core + - Table Metadata parser now accepts null for fields: current-snapshot-id, properties, and snapshots ([\#8064](https://github.com/apache/iceberg/pull/8064)) +* Hive + - Fix HiveCatalog deleting metadata on failures in checking lock status ([\#7931](https://github.com/apache/iceberg/pull/7931)) +* Spark + - Fix RewritePositionDeleteFiles failure for certain partition types ([\#8059](https://github.com/apache/iceberg/pull/8059)) + - Fix RewriteDataFiles concurrency edge-case on commit timeouts ([\#7933](https://github.com/apache/iceberg/pull/7933)) + - Fix partition-level DELETE operations for WAP branches ([\#7900](https://github.com/apache/iceberg/pull/7900)) +* Flink + - FlinkCatalog creation no longer creates the default database ([\#8039](https://github.com/apache/iceberg/pull/8039)) + +## Past releases + +### 1.3.0 release + +Apache Iceberg 1.3.0 was released on May 30th, 2023. +The 1.3.0 release adds a vari
Re: [PR] Phase 1 - New Docs Deployment [iceberg]
aokolnychyi commented on PR #8659: URL: https://github.com/apache/iceberg/pull/8659#issuecomment-1747740766 My primary concern of moving the docs into the main repo was versioning and pollution. It seems like `git worktree` should solve that. I deployed this locally, it seems pretty straightforward. Do we need to adjust our `.gitignore`? I did see some extra files when deploying locally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Phase 1 - New Docs Deployment [iceberg]
bitsondatadev commented on PR #8659: URL: https://github.com/apache/iceberg/pull/8659#issuecomment-1747804852 > My primary concern of moving the docs into the main repo was versioning and pollution. It seems like `git worktree` should solve that. I deployed this locally, it seems pretty straightforward. > > > > Do we need to adjust our `.gitignore`? I did see some extra files when deploying locally. There's still a few changes I need to make that I missed when adapting this from my sandbox repo. My goal is to not have any artifacts like this remaining once this is merged. I want this to be a good baseline to generate the older versions of the docs with the consistent mkdocs format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Wap branch does not support reading from the partitions table [iceberg]
github-actions[bot] commented on issue #7297: URL: https://github.com/apache/iceberg/issues/7297#issuecomment-1747821140 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Upsert support for keyless Apache Flink tables [iceberg]
Ge opened a new issue, #8719: URL: https://github.com/apache/iceberg/issues/8719 ### Feature Request / Improvement Consider the following continuous insertion into a keyless table: ``` SET 'execution.checkpointing.interval' = '10 s'; SET 'sql-client.execution.result-mode' = 'tableau'; SET 'pipeline.max-parallelism' = '5'; CREATE CATALOG nessie_catalog WITH ( 'type'='iceberg', 'catalog-impl'='org.apache.iceberg.nessie.NessieCatalog', 'io-impl'='org.apache.iceberg.aws.s3.S3FileIO', 'uri'='http://catalog:19120/api/v1', 'authentication.type'='none', 'client.assume-role.region'='us-east-1', 'warehouse' = 's3://warehouse', 's3.endpoint'='http://127.0.0.1:9000' ); USE CATALOG nessie_catalog; CREATE DATABASE IF NOT EXISTS db; USE db; CREATE TEMPORARY TABLE word_table ( word STRING ) WITH ( 'connector' = 'datagen', 'fields.word.length' = '1' ); CREATE TABLE word_count ( word STRING, cnt BIGINT ) PARTITIONED BY (`ptn`) WITH ( 'format-version'='2', 'write.upsert.enabled'='true' ); INSERT INTO word_count SELECT word, COUNT(*) FROM word_table GROUP BY word; ``` The output of the `SELECT` query there looks like: ``` Flink SQL> SELECT word, COUNT(*) as cnt FROM word_table GROUP BY word LIMIT 5; +++--+ | op | word | cnt | +++--+ | +I | e |1 | | -D | e |1 | | +I | e |2 | | +I | 7 |1 | | +I | 5 |1 | | -D | e |2 | ... ``` Now, consider the following query against the `word_count` table: ``` Flink SQL> SELECT * FROM word_count LIMIT 10; ``` Expected result: latest counts for 10 words. Actual result: ``` [ERROR] Could not execute SQL statement. Reason: java.lang.IllegalStateException: Equality field columns shouldn't be empty when configuring to use UPSERT data stream. ``` ### Query engine Flink -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Null support in Apache Flink [iceberg]
Ge opened a new issue, #8720: URL: https://github.com/apache/iceberg/issues/8720 ### Query engine Flink 1.17.1 ### Question According to https://iceberg.apache.org/docs/latest/flink/#flink-to-iceberg, Iceberg does not handle Flink's `null`. Can you please describe the cases where the `null` support is missing or limited? I tried a very basic example - inserted a row without specifying values for all columns and it worked: ``` CREATE TABLE word_count ( word STRING, cnt BIGINT, PRIMARY KEY(`word`) NOT ENFORCED ) PARTITIONED BY (word) WITH ( 'format-version'='2', 'write.upsert.enabled'='true' ); INSERT INTO word_count (word) VALUES ('zz'); SELECT * FROM word_count WHERE word = 'zz'; +++--+ | op | word | cnt | +++--+ | +I | zz || +++--+ Received a total of 1 row ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Thread.sleep() method is replaced with Awaitility [iceberg]
nk1506 commented on code in PR #8715: URL: https://github.com/apache/iceberg/pull/8715#discussion_r1346743526 ## flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceContinuous.java: ## @@ -325,7 +328,7 @@ public void testSpecificSnapshotTimestamp() throws Exception { long snapshot0Timestamp = tableResource.table().currentSnapshot().timestampMillis(); // sleep for 2 ms to make sure snapshot1 has a higher timestamp value -Thread.sleep(2); +Awaitility.await().atLeast(Duration.ofMillis(2)).until(() -> true); Review Comment: If thread just requires to sleep for some duration , ideally it should be like `Awaitility.await().pollDelay(2, TimeUnit.MILLISECONDS).until(() -> System.currentTimeMillis()-snapshot0Timestamp>2);` ## flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceContinuous.java: ## @@ -403,7 +406,7 @@ public static List waitForResult(CloseableIterator iter, int limit) { public static void waitUntilJobIsRunning(ClusterClient client) throws Exception { Review Comment: remove this method and instead of this it should be something like `Awaitility.await().pollDelay(10, TimeUnit.MILLISECONDS).until(() -> !getRunningJobs(MINI_CLUSTER_RESOURCE.getClusterClient()).isEmpty());` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Document all metadata tables. [iceberg]
nk1506 commented on PR #8709: URL: https://github.com/apache/iceberg/pull/8709#issuecomment-1747996344 @szehon-ho , Please review and share the feedback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Document publish_changes procedure [iceberg]
nk1506 commented on PR #8706: URL: https://github.com/apache/iceberg/pull/8706#issuecomment-1747996641 @nastra , please look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Thread.sleep() method is replaced with Awaitility [iceberg]
shreyanshR7 commented on PR #8715: URL: https://github.com/apache/iceberg/pull/8715#issuecomment-1748006574 Thanks for the suggestion @nk1506 ,I'll update that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: add types timstamp_ns and timestamptz_ns [iceberg]
jacobmarble commented on PR #8683: URL: https://github.com/apache/iceberg/pull/8683#issuecomment-1748044374 @Fokko - friendly reminder to review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Thread.sleep() method is replaced with Awaitility [iceberg]
shreyanshR7 commented on PR #8715: URL: https://github.com/apache/iceberg/pull/8715#issuecomment-1748046446 I've made the changes@nk1506 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Push filtering for Iceberg table type to Hive MetaStore when listing tables [iceberg]
pvary commented on PR #2722: URL: https://github.com/apache/iceberg/pull/2722#issuecomment-1748085546 @mderoy: I am still here to review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support deletion in Apache Flink [iceberg]
pvary commented on issue #8718: URL: https://github.com/apache/iceberg/issues/8718#issuecomment-1748095935 Is this for a V2 table? I have seen deleting rows working using V2 table, Java code with the stream API, but I yet to try out SQL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support deletion in Apache Flink [iceberg]
Ge commented on issue #8718: URL: https://github.com/apache/iceberg/issues/8718#issuecomment-1748157735 Yes, this is a V2 table. I added the DDL to the description now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Document publish_changes procedure [iceberg]
nastra merged PR #8706: URL: https://github.com/apache/iceberg/pull/8706 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Spark SQL Extensions: Document all stored procedures [iceberg]
nastra closed issue #1601: Spark SQL Extensions: Document all stored procedures URL: https://github.com/apache/iceberg/issues/1601 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Dell : Migrate Files using TestRule to Junit5. [iceberg]
ashutosh-roy commented on code in PR #8707: URL: https://github.com/apache/iceberg/pull/8707#discussion_r1346869623 ## dell/src/test/java/org/apache/iceberg/dell/mock/ecs/EcsS3MockRule.java: ## @@ -178,4 +163,16 @@ public String bucket() { public String randomObjectName() { return "test-" + ID.getAndIncrement() + "-" + UUID.randomUUID(); } + + @Override + public void afterEach(ExtensionContext extensionContext) { +System.out.println("Bucket removed"); +cleanUp(); + } + + @Override + public void beforeEach(ExtensionContext extensionContext) { +System.out.println("Bucket removed"); Review Comment: @nk1506 if there are no feedbacks, are we good to merge this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Thread.sleep() method is replaced with Awaitility [iceberg]
nastra commented on PR #8715: URL: https://github.com/apache/iceberg/pull/8715#issuecomment-1748190639 The goal of https://github.com/apache/iceberg/issues/7154 is to convert `Thread.sleep` usages to Awaitility where it makes sense. We don't want to blindly just replace all `Thread.sleep` usage. We need to understand why the sleep is used in the first place. Typically, `Thread.sleep` is used to wait for an async condition to happen. Rather than waiting the max sleep time, we want to wait less than that, by checking a success condition that indicates we can proceed. Here is a good example where the test was made more stable by the use of Awaitility: https://github.com/apache/iceberg/blob/cff2ff9d2f533c8e733324a66f3101eafbc2e932/core/src/test/java/org/apache/iceberg/actions/TestCommitService.java#L99-L102 We wait at most 5 seconds until the given assertion is met. For this PR that means a good starting point might be https://github.com/apache/iceberg/blob/f9f85e7f2647020ef7e9571f76027c8e55075690/flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamingMonitorFunction.java#L153, where there's a 1s sleep and then a check. That's the kind of things we want to replace with Awaitility -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Spark fails to write into an iceberg table after updating its schema [iceberg]
paulpaul1076 opened a new issue, #8721: URL: https://github.com/apache/iceberg/issues/8721 ### Apache Iceberg version 1.3.1 (latest release) ### Query engine Spark ### Please describe the bug 🐞 Spark fails to write the dataframe with new schema after updating the schema of a table: ``` import org.apache.iceberg.{CatalogUtil, Schema} import org.apache.iceberg.catalog.{Catalog, TableIdentifier} import org.apache.iceberg.types.Types import org.apache.spark.sql.SparkSession import java.util.Properties //https://iceberg.apache.org/docs/latest/nessie/ object IcebergJobNessie extends App { val spark = SparkSession.builder() .master("local[*]") .appName("iceberg test") .config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions") .config("spark.sql.catalog.nessie.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog") .config("spark.sql.catalog.nessie.ref", "main") .config("spark.sql.catalog.nessie.uri", "http://localhost:19120/api/v1";) .config("spark.sql.catalog.nessie.s3.endpoint", "***") .config("spark.sql.catalog.nessie.s3.access.key", "***") .config("spark.sql.catalog.nessie.s3.secret.key", "***") .config("spark.sql.defaultCatalog", "nessie") .config("spark.hadoop.fs.s3a.endpoint", "***") .config("spark.hadoop.fs.s3a.access.key", "***") .config("spark.hadoop.fs.s3a.secret.key", "***") .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") .config("spark.sql.catalog.nessie.s3a.path-style-access ", "true") .config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.nessie.warehouse", "s3://hdp-temp/iceberg_catalog") .getOrCreate() import spark.implicits._ val options = new java.util.HashMap[String, String]() options.put("warehouse", "s3://hdp-temp/iceberg_catalog") options.put("ref", "main") options.put("uri", "http://localhost:19120/api/v1";) val nessieCatalog: Catalog = CatalogUtil.loadCatalog("org.apache.iceberg.nessie.NessieCatalog", "nessie", options, spark.sparkContext.hadoopConfiguration) // -PART 1--- val name = TableIdentifier.of("db_nessie", "schema_evolution15") val schema = new Schema( Types.NestedField.required(1, "age", Types.IntegerType.get()), Types.NestedField.optional(2, "sibling_info", Types.ListType.ofOptional(3, Types.StructType.of( Types.NestedField.required(4, "age", Types.IntegerType.get()), Types.NestedField.optional(5, "name", Types.StringType.get()) )) ) ) nessieCatalog.createTable(name, schema) val df = List( (1, List( SiblingInfo(1, "John"), SiblingInfo(2, "Sean"), SiblingInfo(3, "Peter")) ), (12, List( SiblingInfo(13, "Ivan"), SiblingInfo(11, "Sean") ) )).toDF("age", "sibling_info") df.writeTo("db_nessie.schema_evolution15").append() spark.sql("select * from db_nessie.schema_evolution15").show(false) val table = nessieCatalog.loadTable(name) val newIcebergSchema = new Schema( Types.NestedField.required(1, "age", Types.IntegerType.get()), Types.NestedField.optional(2, "sibling_info", Types.ListType.ofOptional(3, Types.StructType.of( Types.NestedField.required(4, "age", Types.IntegerType.get()), Types.NestedField.optional(5, "name", Types.StringType.get()), Types.NestedField.optional(6, "lastName", Types.StringType.get()) )) ) ) table.updateSchema() .unionByNameWith(newIcebergSchema) .commit() // -PART 2--- val df2 = List( (1, List( SiblingInfo2(1, "John", "Johnson"), SiblingInfo2(2, "Sean", "Johnson"), SiblingInfo2(3, "Peter", "Johnson")) ), (12, List( SiblingInfo2(13, "Ivan", "Johnson"), SiblingInfo2(11, "Test", "Johnson") ) )).toDF("age", "sibling_info") df2.writeTo("db_nessie.schema_evolution15").append() spark.sql("select * from db_nessie.schema_evolution15").show(false) } ``` The exception is: ``` Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot write incompatible data to table 'spark_catalog1.db.schema_evolution15': - Cannot write nullable values to non-null column 'sibling_info.x.age'. at org.apache.spark.