[GitHub] [iceberg-docs] quantsegu opened a new pull request, #192: Adjusted on comments for adding IOMETE as a vendor
quantsegu opened a new pull request, #192: URL: https://github.com/apache/iceberg-docs/pull/192 The small textual adjustments are made. Can you please finalize this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hililiwei commented on a diff in pull request #5984: Core, API: Support incremental scanning with branch
hililiwei commented on code in PR #5984: URL: https://github.com/apache/iceberg/pull/5984#discussion_r1056798443 ## api/src/main/java/org/apache/iceberg/IncrementalScan.java: ## @@ -21,6 +21,23 @@ /** API for configuring an incremental scan. */ public interface IncrementalScan> extends Scan { + + /** + * Instructs this scan to look for changes starting from a particular snapshot (inclusive). + * + * If the start snapshot is not configured, it is defaulted to the oldest ancestor of the end + * snapshot (inclusive). + * + * @param fromSnapshotId the start snapshot ID (inclusive) + * @param referenceName the ref used + * @return this for method chaining + * @throws IllegalArgumentException if the start snapshot is not an ancestor of the end snapshot + */ + default ThisT fromSnapshotInclusive(long fromSnapshotId, String referenceName) { Review Comment: Agree with @stevenzwu. Yes, tag is a fixed point in time, but when using it for incremental read, we can think of it semantically the same as using `fromSnapshot(Long snapshotId)`. Just like the @stevenzwu's example, I have daily tags(`20220101` `20220102`), If I want to read the incremental data from `20220102` to the current., so I can use `fromSnapshotExclusive("20220102")`: ``` table.newIncrementalScan() .fromSnapshotExclusive("20220102") .planTasks() ``` Another way is to use snapshot time to find the snapshot id first, but sometimes that doesn't work. For example, we can generate tags based on the event time of the data, or we tag the snapshot only after the application has completed. The application may finish at 3:00 2022/01/02, and tag the newly generated snapshot as `20220102`. If we use snapshot time `2022-01-02 00:00:00` to find the snapshot ID, incorrect incremental data will be return. cc @rdblue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hililiwei commented on pull request #6253: Flink: Write watermark to the snapshot summary
hililiwei commented on PR #6253: URL: https://github.com/apache/iceberg/pull/6253#issuecomment-1364513120 >should we write the metadata as snapshot summary or table properties? It is minimal change to write as snapshot summary as shown in this PR or PR https://github.com/apache/iceberg/pull/2109. It is a little bigger change (like using transaction) to write as a table property, but it will be easier for consumer to extract the info. I think it's two aspects. It represents different meanings. I include it in the summary because it does not represent the water mark for the table, but simply represents the current flink task that generated the snapshot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hililiwei commented on pull request #6160: Flink: Support locality with LocalitySplitAssigner
hililiwei commented on PR #6160: URL: https://github.com/apache/iceberg/pull/6160#issuecomment-1364514213 @stevenzwu @openinx @rdblue @Fokko @pvary could you please take a look at it when you get a chance? thx. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-docs] RussellSpitzer merged pull request #192: Adjusted on comments for adding IOMETE as a vendor
RussellSpitzer merged PR #192: URL: https://github.com/apache/iceberg-docs/pull/192 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ConeyLiu commented on pull request #3249: Optimized spark vectorized read parquet decimal
ConeyLiu commented on PR #3249: URL: https://github.com/apache/iceberg/pull/3249#issuecomment-1364527752 @nastra just addressed the comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins
aokolnychyi commented on code in PR #6371: URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852402 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java: ## @@ -356,4 +357,23 @@ private GenericArrayData fillArray( return new GenericArrayData(array); } + + @Override + public boolean equals(Object other) { Review Comment: It is a bug in Spark that will be fixed. Spark uses `groupBy` in one place, which relies on equals. All other places use `InternalRowSet`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins
aokolnychyi commented on code in PR #6371: URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852402 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java: ## @@ -356,4 +357,23 @@ private GenericArrayData fillArray( return new GenericArrayData(array); } + + @Override + public boolean equals(Object other) { Review Comment: It is a bug in Spark that will be fixed. Spark uses `groupBy` in one place, which relies on `equals`. All other places use `InternalRowSet`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins
aokolnychyi commented on code in PR #6371: URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852535 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java: ## @@ -147,9 +152,9 @@ protected Statistics estimateStatistics(Snapshot snapshot) { @Override public String description() { -String filters = - filterExpressions.stream().map(Spark3Util::describe).collect(Collectors.joining(", ")); -return String.format("%s [filters=%s]", table, filters); +return String.format( +"%s [filters=%s, groupedBy=%s]", +table(), Spark3Util.describe(filterExpressions), groupingKeyType()); Review Comment: Switched to column names only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins
aokolnychyi commented on code in PR #6371: URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852815 ## core/src/main/java/org/apache/iceberg/Partitioning.java: ## @@ -200,6 +200,16 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) { /** * Builds a grouping key type considering all provided specs. * + * @param specs one or many specs + * @return the constructed grouping key type + */ + public static StructType groupingKeyType(Collection specs) { Review Comment: Deprecated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins
aokolnychyi commented on code in PR #6371: URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852878 ## api/src/main/java/org/apache/iceberg/types/Types.java: ## @@ -554,6 +554,10 @@ public List fields() { return lazyFieldList(); } +public boolean containsField(int id) { Review Comment: Rewrote the other part and removed this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins
aokolnychyi commented on code in PR #6371: URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852903 ## spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestCopyOnWriteUpdate.java: ## @@ -140,4 +144,35 @@ public synchronized void testUpdateWithConcurrentTableRefresh() throws Exception executorService.shutdown(); Assert.assertTrue("Timeout", executorService.awaitTermination(2, TimeUnit.MINUTES)); } + + @Test + public void testRuntimeFilteringWithReportedPartitioning() { +createAndInitTable("id INT, dep STRING"); +sql("ALTER TABLE %s ADD PARTITION FIELD dep", tableName); + +append(tableName, "{ \"id\": 1, \"dep\": \"hr\" }\n" + "{ \"id\": 3, \"dep\": \"hr\" }"); +append( +tableName, +"{ \"id\": 1, \"dep\": \"hardware\" }\n" + "{ \"id\": 2, \"dep\": \"hardware\" }"); + +Map sqlConf = +ImmutableMap.of( +SQLConf.V2_BUCKETING_ENABLED().key(), +"true", +SparkSQLProperties.PRESERVE_DATA_GROUPING, +"true"); + +withSQLConf(sqlConf, () -> sql("UPDATE %s SET id = cast('-1' AS INT) WHERE id = 2", tableName)); Review Comment: Removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins
aokolnychyi commented on code in PR #6371: URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852915 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java: ## @@ -42,4 +42,9 @@ private SparkSQLProperties() {} // Controls whether to check the order of fields during writes public static final String CHECK_ORDERING = "spark.sql.iceberg.check-ordering"; public static final boolean CHECK_ORDERING_DEFAULT = true; + + // Controls whether to preserve the existing grouping of data while planning splits + public static final String PRESERVE_DATA_GROUPING = + "spark.sql.iceberg.split.preserve-data-grouping"; Review Comment: Switched to `planning`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins
aokolnychyi commented on code in PR #6371: URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852950 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.source; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.Collections; +import java.util.List; +import java.util.Set; +import org.apache.iceberg.BaseScanTaskGroup; +import org.apache.iceberg.PartitionField; +import org.apache.iceberg.PartitionScanTask; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Scan; +import org.apache.iceberg.ScanTask; +import org.apache.iceberg.ScanTaskGroup; +import org.apache.iceberg.Schema; +import org.apache.iceberg.StructLike; +import org.apache.iceberg.Table; +import org.apache.iceberg.exceptions.ValidationException; +import org.apache.iceberg.expressions.Expression; +import org.apache.iceberg.io.CloseableIterable; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.relocated.com.google.common.collect.Sets; +import org.apache.iceberg.spark.Spark3Util; +import org.apache.iceberg.spark.SparkReadConf; +import org.apache.iceberg.types.Types.StructType; +import org.apache.iceberg.util.StructLikeSet; +import org.apache.iceberg.util.TableScanUtil; +import org.apache.spark.sql.SparkSession; +import org.apache.spark.sql.connector.expressions.Transform; +import org.apache.spark.sql.connector.read.SupportsReportPartitioning; +import org.apache.spark.sql.connector.read.partitioning.KeyGroupedPartitioning; +import org.apache.spark.sql.connector.read.partitioning.Partitioning; +import org.apache.spark.sql.connector.read.partitioning.UnknownPartitioning; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +abstract class SparkPartitioningAwareScan extends SparkScan +implements SupportsReportPartitioning { + + private static final Logger LOG = LoggerFactory.getLogger(SparkPartitioningAwareScan.class); + + private final Scan> scan; + private final boolean preserveDataGrouping; + + private Set specs = null; // lazy cache of scanned specs + private List tasks = null; // lazy cache of uncombined tasks + private List> taskGroups = null; // lazy cache of task groups + private StructType groupingKeyType = null; // lazy cache of the grouping key type + private Transform[] groupingKeyTransforms = null; // lazy cache of grouping key transforms + private StructLikeSet groupingKeys = null; // lazy cache of grouping keys + + SparkPartitioningAwareScan( + SparkSession spark, + Table table, + Scan> scan, + SparkReadConf readConf, + Schema expectedSchema, + List filters) { + +super(spark, table, readConf, expectedSchema, filters); + +this.scan = scan; +this.preserveDataGrouping = readConf.preserveDataGrouping(); + +if (scan == null) { + this.specs = Collections.emptySet(); + this.tasks = Collections.emptyList(); + this.taskGroups = Collections.emptyList(); +} + } + + protected abstract Class taskJavaClass(); + + protected Scan> scan() { +return scan; + } + + @Override + public Partitioning outputPartitioning() { +if (groupingKeyType().fields().isEmpty()) { + LOG.info("Reporting UnknownPartitioning with {} partition(s)", taskGroups().size()); + return new UnknownPartitioning(taskGroups().size()); +} else { + LOG.info( + "Reporting KeyGroupedPartitioning by {} with {} partition(s)", + groupingKeyTransforms(), + taskGroups().size()); + return new KeyGroupedPartitioning(groupingKeyTransforms(), taskGroups().size()); +} + } + + @Override + protected StructType groupingKeyType() { +if (groupingKeyType == null) { + if (preserveDataGrouping) { +this.groupingKeyType = computeGroupingKeyType(); + } else { +this.groupingKeyType = StructType.of(); + } +} + +return groupingKeyType; + } + + private StructType computeGroupingKeyType() { +return org.apache.iceberg.Partitioning.groupingKeyType(expectedSchema(), specs(
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins
aokolnychyi commented on code in PR #6371: URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056862209 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.source; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.Collections; +import java.util.List; +import java.util.Set; +import org.apache.iceberg.BaseScanTaskGroup; +import org.apache.iceberg.PartitionField; +import org.apache.iceberg.PartitionScanTask; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Scan; +import org.apache.iceberg.ScanTask; +import org.apache.iceberg.ScanTaskGroup; +import org.apache.iceberg.Schema; +import org.apache.iceberg.StructLike; +import org.apache.iceberg.Table; +import org.apache.iceberg.exceptions.ValidationException; +import org.apache.iceberg.expressions.Expression; +import org.apache.iceberg.io.CloseableIterable; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.relocated.com.google.common.collect.Sets; +import org.apache.iceberg.spark.Spark3Util; +import org.apache.iceberg.spark.SparkReadConf; +import org.apache.iceberg.types.Types.StructType; +import org.apache.iceberg.util.StructLikeSet; +import org.apache.iceberg.util.TableScanUtil; +import org.apache.spark.sql.SparkSession; +import org.apache.spark.sql.connector.expressions.Transform; +import org.apache.spark.sql.connector.read.SupportsReportPartitioning; +import org.apache.spark.sql.connector.read.partitioning.KeyGroupedPartitioning; +import org.apache.spark.sql.connector.read.partitioning.Partitioning; +import org.apache.spark.sql.connector.read.partitioning.UnknownPartitioning; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +abstract class SparkPartitioningAwareScan extends SparkScan +implements SupportsReportPartitioning { + + private static final Logger LOG = LoggerFactory.getLogger(SparkPartitioningAwareScan.class); + + private final Scan> scan; + private final boolean preserveDataGrouping; + + private Set specs = null; // lazy cache of scanned specs + private List tasks = null; // lazy cache of uncombined tasks + private List> taskGroups = null; // lazy cache of task groups + private StructType groupingKeyType = null; // lazy cache of the grouping key type + private Transform[] groupingKeyTransforms = null; // lazy cache of grouping key transforms + private StructLikeSet groupingKeys = null; // lazy cache of grouping keys + + SparkPartitioningAwareScan( + SparkSession spark, + Table table, + Scan> scan, + SparkReadConf readConf, + Schema expectedSchema, + List filters) { + +super(spark, table, readConf, expectedSchema, filters); + +this.scan = scan; +this.preserveDataGrouping = readConf.preserveDataGrouping(); + +if (scan == null) { + this.specs = Collections.emptySet(); + this.tasks = Collections.emptyList(); + this.taskGroups = Collections.emptyList(); +} + } + + protected abstract Class taskJavaClass(); + + protected Scan> scan() { +return scan; + } + + @Override + public Partitioning outputPartitioning() { +if (groupingKeyType().fields().isEmpty()) { + LOG.info("Reporting UnknownPartitioning with {} partition(s)", taskGroups().size()); + return new UnknownPartitioning(taskGroups().size()); +} else { + LOG.info( + "Reporting KeyGroupedPartitioning by {} with {} partition(s)", + groupingKeyTransforms(), + taskGroups().size()); + return new KeyGroupedPartitioning(groupingKeyTransforms(), taskGroups().size()); +} + } + + @Override + protected StructType groupingKeyType() { +if (groupingKeyType == null) { + if (preserveDataGrouping) { +this.groupingKeyType = computeGroupingKeyType(); + } else { +this.groupingKeyType = StructType.of(); + } +} + +return groupingKeyType; + } + + private StructType computeGroupingKeyType() { +return org.apache.iceberg.Partitioning.groupingKeyType(expectedSchema(), specs(
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins
aokolnychyi commented on code in PR #6371: URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056862319 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.source; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.Collections; +import java.util.List; +import java.util.Set; +import org.apache.iceberg.BaseScanTaskGroup; +import org.apache.iceberg.PartitionField; +import org.apache.iceberg.PartitionScanTask; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Scan; +import org.apache.iceberg.ScanTask; +import org.apache.iceberg.ScanTaskGroup; +import org.apache.iceberg.Schema; +import org.apache.iceberg.StructLike; +import org.apache.iceberg.Table; +import org.apache.iceberg.exceptions.ValidationException; +import org.apache.iceberg.expressions.Expression; +import org.apache.iceberg.io.CloseableIterable; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.relocated.com.google.common.collect.Sets; +import org.apache.iceberg.spark.Spark3Util; +import org.apache.iceberg.spark.SparkReadConf; +import org.apache.iceberg.types.Types.StructType; +import org.apache.iceberg.util.StructLikeSet; +import org.apache.iceberg.util.TableScanUtil; +import org.apache.spark.sql.SparkSession; +import org.apache.spark.sql.connector.expressions.Transform; +import org.apache.spark.sql.connector.read.SupportsReportPartitioning; +import org.apache.spark.sql.connector.read.partitioning.KeyGroupedPartitioning; +import org.apache.spark.sql.connector.read.partitioning.Partitioning; +import org.apache.spark.sql.connector.read.partitioning.UnknownPartitioning; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +abstract class SparkPartitioningAwareScan extends SparkScan +implements SupportsReportPartitioning { + + private static final Logger LOG = LoggerFactory.getLogger(SparkPartitioningAwareScan.class); + + private final Scan> scan; + private final boolean preserveDataGrouping; + + private Set specs = null; // lazy cache of scanned specs + private List tasks = null; // lazy cache of uncombined tasks + private List> taskGroups = null; // lazy cache of task groups + private StructType groupingKeyType = null; // lazy cache of the grouping key type + private Transform[] groupingKeyTransforms = null; // lazy cache of grouping key transforms + private StructLikeSet groupingKeys = null; // lazy cache of grouping keys + + SparkPartitioningAwareScan( + SparkSession spark, + Table table, + Scan> scan, + SparkReadConf readConf, + Schema expectedSchema, + List filters) { + +super(spark, table, readConf, expectedSchema, filters); + +this.scan = scan; +this.preserveDataGrouping = readConf.preserveDataGrouping(); + +if (scan == null) { + this.specs = Collections.emptySet(); + this.tasks = Collections.emptyList(); + this.taskGroups = Collections.emptyList(); +} + } + + protected abstract Class taskJavaClass(); + + protected Scan> scan() { +return scan; + } + + @Override + public Partitioning outputPartitioning() { +if (groupingKeyType().fields().isEmpty()) { + LOG.info("Reporting UnknownPartitioning with {} partition(s)", taskGroups().size()); + return new UnknownPartitioning(taskGroups().size()); +} else { + LOG.info( + "Reporting KeyGroupedPartitioning by {} with {} partition(s)", + groupingKeyTransforms(), + taskGroups().size()); + return new KeyGroupedPartitioning(groupingKeyTransforms(), taskGroups().size()); +} + } + + @Override + protected StructType groupingKeyType() { +if (groupingKeyType == null) { + if (preserveDataGrouping) { +this.groupingKeyType = computeGroupingKeyType(); + } else { +this.groupingKeyType = StructType.of(); + } +} + +return groupingKeyType; + } + + private StructType computeGroupingKeyType() { +return org.apache.iceberg.Partitioning.groupingKeyType(expectedSchema(), specs(
[GitHub] [iceberg] aokolnychyi merged pull request #6371: Spark 3.3: Support storage-partitioned joins
aokolnychyi merged PR #6371: URL: https://github.com/apache/iceberg/pull/6371 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on pull request #6371: Spark 3.3: Support storage-partitioned joins
aokolnychyi commented on PR #6371: URL: https://github.com/apache/iceberg/pull/6371#issuecomment-1364570187 Thanks for reviewing, @RussellSpitzer @sunchao @zinking @rdblue! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on issue #430: Support bucket table for Iceberg
aokolnychyi commented on issue #430: URL: https://github.com/apache/iceberg/issues/430#issuecomment-1364570547 I am excited to announce that support for storage-partitioned joins has been merged into master. It will be shipped in 1.2.0. Thanks everyone involved, especially @sunchao. I am going to resolve this issue. See PR #6371. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi closed issue #430: Support bucket table for Iceberg
aokolnychyi closed issue #430: Support bucket table for Iceberg URL: https://github.com/apache/iceberg/issues/430 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID
Fokko commented on code in PR #6437: URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056874376 ## python/pyiceberg/io/pyarrow.py: ## @@ -437,3 +465,198 @@ def visit_or(self, left_result: pc.Expression, right_result: pc.Expression) -> p def expression_to_pyarrow(expr: BooleanExpression) -> pc.Expression: return boolean_expression_visit(expr, _ConvertToArrowExpression()) + + +def project_table( +files: Iterable[FileScanTask], table: Table, row_filter: BooleanExpression, projected_schema: Schema, case_sensitive: bool +) -> pa.Table: +"""Resolves the right columns based on the identifier + +Args: +files(Iterable[FileScanTask]): A URI or a path to a local file +table(Table): The table that's being queried +row_filter(BooleanExpression): The expression for filtering rows +projected_schema(Schema): The output schema +case_sensitive(bool): Case sensitivity when looking up column names + +Raises: +ResolveException: When an incompatible query is done +""" + +if isinstance(table.io, PyArrowFileIO): +scheme, path = PyArrowFileIO.parse_location(table.location()) +fs = table.io.get_fs(scheme) +else: +raise ValueError(f"Expected PyArrowFileIO, got: {table.io}") + +bound_row_filter = bind(table.schema(), row_filter, case_sensitive=case_sensitive) + +projected_field_ids = { +id for id in projected_schema.field_ids if not isinstance(projected_schema.find_type(id), (MapType, ListType)) +}.union(extract_field_ids(bound_row_filter)) + +tables = [] +for task in files: +_, path = PyArrowFileIO.parse_location(task.file.file_path) + +# Get the schema +with fs.open_input_file(path) as fout: +parquet_schema = pq.read_schema(fout) +schema_raw = parquet_schema.metadata.get(ICEBERG_SCHEMA) +file_schema = Schema.parse_raw(schema_raw) + +pyarrow_filter = None +if row_filter is not AlwaysTrue(): +translated_row_filter = translate_column_names(bound_row_filter, file_schema, case_sensitive=case_sensitive) +bound_row_filter = bind(file_schema, translated_row_filter, case_sensitive=case_sensitive) +pyarrow_filter = expression_to_pyarrow(bound_row_filter) + +file_project_schema = prune_columns(file_schema, projected_field_ids, select_full_types=False) + +if file_schema is None: +raise ValueError(f"Missing Iceberg schema in Metadata for file: {path}") + +# Prune the stuff that we don't need anyway +file_project_schema_arrow = schema_to_pyarrow(file_project_schema) + +arrow_table = ds.dataset( +source=[path], schema=file_project_schema_arrow, format=ds.ParquetFileFormat(), filesystem=fs +).to_table(filter=pyarrow_filter) + +tables.append(to_requested_schema(projected_schema, file_project_schema, arrow_table)) + +if len(tables) > 1: +return pa.concat_tables(tables) +else: +return tables[0] + + +def to_requested_schema(requested_schema: Schema, file_schema: Schema, table: pa.Table) -> pa.Table: +return VisitWithArrow(requested_schema, file_schema, table).visit() + + +class VisitWithArrow: +requested_schema: Schema +file_schema: Schema +table: pa.Table + +def __init__(self, requested_schema: Schema, file_schema: Schema, table: pa.Table) -> None: +self.requested_schema = requested_schema +self.file_schema = file_schema +self.table = table + +def visit(self) -> pa.Table: +return self.visit_with_arrow(self.requested_schema, self.file_schema) + +@singledispatchmethod +def visit_with_arrow(self, requested_schema: Union[Schema, IcebergType], file_schema: Union[Schema, IcebergType]) -> pa.Table: +"""A generic function for applying a schema visitor to any point within a schema + +The function traverses the schema in post-order fashion + +Args: +obj(Schema | IcebergType): An instance of a Schema or an IcebergType +visitor (VisitWithArrow[T]): An instance of an implementation of the generic VisitWithArrow base class + +Raises: +NotImplementedError: If attempting to visit an unrecognized object type +""" +raise NotImplementedError(f"Cannot visit non-type: {requested_schema}") + +@visit_with_arrow.register(Schema) +def _(self, requested_schema: Schema, file_schema: Schema) -> pa.Table: +"""Visit a Schema with a concrete SchemaVisitorWithPartner""" +struct_result = self.visit_with_arrow(requested_schema.as_struct(), file_schema.as_struct()) +pyarrow_schema = schema_to_pyarrow(requested_schema) +return pa.Table.from_arrays(struct_result.flatten(), schema=pyarrow_schema) + +def _get_field_by_id(self, field_id: int) -> Optional[NestedField]: +try: +return sel
[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID
Fokko commented on code in PR #6437: URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056874464 ## python/pyiceberg/io/pyarrow.py: ## @@ -437,3 +465,198 @@ def visit_or(self, left_result: pc.Expression, right_result: pc.Expression) -> p def expression_to_pyarrow(expr: BooleanExpression) -> pc.Expression: return boolean_expression_visit(expr, _ConvertToArrowExpression()) + + +def project_table( +files: Iterable[FileScanTask], table: Table, row_filter: BooleanExpression, projected_schema: Schema, case_sensitive: bool +) -> pa.Table: +"""Resolves the right columns based on the identifier + +Args: +files(Iterable[FileScanTask]): A URI or a path to a local file +table(Table): The table that's being queried +row_filter(BooleanExpression): The expression for filtering rows +projected_schema(Schema): The output schema +case_sensitive(bool): Case sensitivity when looking up column names + +Raises: +ResolveException: When an incompatible query is done +""" + +if isinstance(table.io, PyArrowFileIO): +scheme, path = PyArrowFileIO.parse_location(table.location()) +fs = table.io.get_fs(scheme) +else: +raise ValueError(f"Expected PyArrowFileIO, got: {table.io}") + +bound_row_filter = bind(table.schema(), row_filter, case_sensitive=case_sensitive) + +projected_field_ids = { +id for id in projected_schema.field_ids if not isinstance(projected_schema.find_type(id), (MapType, ListType)) +}.union(extract_field_ids(bound_row_filter)) + +tables = [] +for task in files: +_, path = PyArrowFileIO.parse_location(task.file.file_path) + +# Get the schema +with fs.open_input_file(path) as fout: +parquet_schema = pq.read_schema(fout) +schema_raw = parquet_schema.metadata.get(ICEBERG_SCHEMA) +file_schema = Schema.parse_raw(schema_raw) + +pyarrow_filter = None +if row_filter is not AlwaysTrue(): +translated_row_filter = translate_column_names(bound_row_filter, file_schema, case_sensitive=case_sensitive) +bound_row_filter = bind(file_schema, translated_row_filter, case_sensitive=case_sensitive) +pyarrow_filter = expression_to_pyarrow(bound_row_filter) + +file_project_schema = prune_columns(file_schema, projected_field_ids, select_full_types=False) + +if file_schema is None: +raise ValueError(f"Missing Iceberg schema in Metadata for file: {path}") + +# Prune the stuff that we don't need anyway +file_project_schema_arrow = schema_to_pyarrow(file_project_schema) + +arrow_table = ds.dataset( +source=[path], schema=file_project_schema_arrow, format=ds.ParquetFileFormat(), filesystem=fs +).to_table(filter=pyarrow_filter) + +tables.append(to_requested_schema(projected_schema, file_project_schema, arrow_table)) + +if len(tables) > 1: +return pa.concat_tables(tables) +else: +return tables[0] + + +def to_requested_schema(requested_schema: Schema, file_schema: Schema, table: pa.Table) -> pa.Table: +return VisitWithArrow(requested_schema, file_schema, table).visit() + + +class VisitWithArrow: +requested_schema: Schema +file_schema: Schema +table: pa.Table + +def __init__(self, requested_schema: Schema, file_schema: Schema, table: pa.Table) -> None: +self.requested_schema = requested_schema +self.file_schema = file_schema +self.table = table + +def visit(self) -> pa.Table: +return self.visit_with_arrow(self.requested_schema, self.file_schema) + +@singledispatchmethod +def visit_with_arrow(self, requested_schema: Union[Schema, IcebergType], file_schema: Union[Schema, IcebergType]) -> pa.Table: +"""A generic function for applying a schema visitor to any point within a schema + +The function traverses the schema in post-order fashion + +Args: +obj(Schema | IcebergType): An instance of a Schema or an IcebergType +visitor (VisitWithArrow[T]): An instance of an implementation of the generic VisitWithArrow base class + +Raises: +NotImplementedError: If attempting to visit an unrecognized object type +""" +raise NotImplementedError(f"Cannot visit non-type: {requested_schema}") + +@visit_with_arrow.register(Schema) +def _(self, requested_schema: Schema, file_schema: Schema) -> pa.Table: +"""Visit a Schema with a concrete SchemaVisitorWithPartner""" +struct_result = self.visit_with_arrow(requested_schema.as_struct(), file_schema.as_struct()) +pyarrow_schema = schema_to_pyarrow(requested_schema) +return pa.Table.from_arrays(struct_result.flatten(), schema=pyarrow_schema) + +def _get_field_by_id(self, field_id: int) -> Optional[NestedField]: +try: +return sel
[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID
Fokko commented on code in PR #6437: URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056874748 ## python/pyiceberg/io/pyarrow.py: ## @@ -437,3 +465,198 @@ def visit_or(self, left_result: pc.Expression, right_result: pc.Expression) -> p def expression_to_pyarrow(expr: BooleanExpression) -> pc.Expression: return boolean_expression_visit(expr, _ConvertToArrowExpression()) + + +def project_table( +files: Iterable[FileScanTask], table: Table, row_filter: BooleanExpression, projected_schema: Schema, case_sensitive: bool +) -> pa.Table: +"""Resolves the right columns based on the identifier + +Args: +files(Iterable[FileScanTask]): A URI or a path to a local file +table(Table): The table that's being queried +row_filter(BooleanExpression): The expression for filtering rows +projected_schema(Schema): The output schema +case_sensitive(bool): Case sensitivity when looking up column names + +Raises: +ResolveException: When an incompatible query is done +""" + +if isinstance(table.io, PyArrowFileIO): +scheme, path = PyArrowFileIO.parse_location(table.location()) +fs = table.io.get_fs(scheme) +else: +raise ValueError(f"Expected PyArrowFileIO, got: {table.io}") + +bound_row_filter = bind(table.schema(), row_filter, case_sensitive=case_sensitive) + +projected_field_ids = { +id for id in projected_schema.field_ids if not isinstance(projected_schema.find_type(id), (MapType, ListType)) Review Comment: This is because we can't select the MapType/ListType: ``` if not field.field_type.is_primitive: > raise ValueError( f"Cannot explicitly project List or Map types, {field.field_id}:{field.name} of type {field.field_type} was selected" ) E ValueError: Cannot explicitly project List or Map types, 5:ids of type list was selected ``` We're only interested in the element in case of the list, and the key-value in case of a map. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID
Fokko commented on code in PR #6437: URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056875584 ## python/pyiceberg/schema.py: ## @@ -1046,3 +1055,79 @@ def _project_map(map_type: MapType, value_result: IcebergType) -> MapType: value_type=value_result, value_required=map_type.value_required, ) + + +@singledispatch +def promote(file_type: IcebergType, read_type: IcebergType) -> IcebergType: +"""Promotes reading a file type to a read type + +Args: +file_type (IcebergType): The type of the Avro file +read_type (IcebergType): The requested read type + +Raises: +ResolveException: If attempting to resolve an unrecognized object type +""" +raise ResolveException(f"Cannot promote {file_type} to {read_type}") + + +@promote.register(IntegerType) +def _(file_type: IntegerType, read_type: IcebergType) -> IcebergType: +if isinstance(read_type, LongType): Review Comment: The promote function is only called when the read and file are different. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID
Fokko commented on code in PR #6437: URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056875610 ## python/pyiceberg/schema.py: ## @@ -1046,3 +1055,79 @@ def _project_map(map_type: MapType, value_result: IcebergType) -> MapType: value_type=value_result, value_required=map_type.value_required, ) + + +@singledispatch +def promote(file_type: IcebergType, read_type: IcebergType) -> IcebergType: +"""Promotes reading a file type to a read type + +Args: +file_type (IcebergType): The type of the Avro file +read_type (IcebergType): The requested read type + +Raises: +ResolveException: If attempting to resolve an unrecognized object type +""" +raise ResolveException(f"Cannot promote {file_type} to {read_type}") Review Comment: Works for me, updated 👍🏻 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID
Fokko commented on code in PR #6437: URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056875777 ## python/pyiceberg/exceptions.py: ## @@ -86,3 +86,7 @@ class NotInstalledError(Exception): class SignError(Exception): """Raises when unable to sign a S3 request""" + + +class ResolveException(Exception): Review Comment: Good catch, updated 👍🏻 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID
Fokko commented on code in PR #6437: URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056875820 ## python/pyiceberg/expressions/visitors.py: ## @@ -753,3 +756,89 @@ def inclusive_projection( schema: Schema, spec: PartitionSpec, case_sensitive: bool = True ) -> Callable[[BooleanExpression], BooleanExpression]: return InclusiveProjection(schema, spec, case_sensitive).project + + +class _ColumnNameTranslator(BooleanExpressionVisitor[BooleanExpression]): +"""Converts the column names with the ones in the actual file + +Args: + file_schema (Schema): The schema of the file + case_sensitive (bool): Whether to consider case when binding a reference to a field in a schema, defaults to True + +Raises: +TypeError: In the case of an UnboundPredicate +ValueError: When a column name cannot be found +""" + +file_schema: Schema +case_sensitive: bool + +def __init__(self, file_schema: Schema, case_sensitive: bool) -> None: +self.file_schema = file_schema +self.case_sensitive = case_sensitive + +def visit_true(self) -> BooleanExpression: +return AlwaysTrue() + +def visit_false(self) -> BooleanExpression: +return AlwaysFalse() + +def visit_not(self, child_result: BooleanExpression) -> BooleanExpression: +return Not(child=child_result) + +def visit_and(self, left_result: BooleanExpression, right_result: BooleanExpression) -> BooleanExpression: +return And(left=left_result, right=right_result) + +def visit_or(self, left_result: BooleanExpression, right_result: BooleanExpression) -> BooleanExpression: +return Or(left=left_result, right=right_result) + +def visit_unbound_predicate(self, predicate: UnboundPredicate[L]) -> BooleanExpression: +raise TypeError(f"Expected Bound Predicate, got: {predicate.term}") + +def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> BooleanExpression: +file_column_name = self.file_schema.find_column_name(predicate.term.ref().field.field_id) + +if not file_column_name: +raise ValueError(f"Not found in schema: {file_column_name}") Review Comment: Great suggestion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID
Fokko commented on code in PR #6437: URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056875859 ## python/pyiceberg/expressions/visitors.py: ## @@ -753,3 +756,89 @@ def inclusive_projection( schema: Schema, spec: PartitionSpec, case_sensitive: bool = True ) -> Callable[[BooleanExpression], BooleanExpression]: return InclusiveProjection(schema, spec, case_sensitive).project + + +class _ColumnNameTranslator(BooleanExpressionVisitor[BooleanExpression]): +"""Converts the column names with the ones in the actual file + +Args: + file_schema (Schema): The schema of the file + case_sensitive (bool): Whether to consider case when binding a reference to a field in a schema, defaults to True + +Raises: +TypeError: In the case of an UnboundPredicate +ValueError: When a column name cannot be found +""" + +file_schema: Schema +case_sensitive: bool + +def __init__(self, file_schema: Schema, case_sensitive: bool) -> None: +self.file_schema = file_schema +self.case_sensitive = case_sensitive + +def visit_true(self) -> BooleanExpression: +return AlwaysTrue() + +def visit_false(self) -> BooleanExpression: +return AlwaysFalse() + +def visit_not(self, child_result: BooleanExpression) -> BooleanExpression: +return Not(child=child_result) + +def visit_and(self, left_result: BooleanExpression, right_result: BooleanExpression) -> BooleanExpression: +return And(left=left_result, right=right_result) + +def visit_or(self, left_result: BooleanExpression, right_result: BooleanExpression) -> BooleanExpression: +return Or(left=left_result, right=right_result) + +def visit_unbound_predicate(self, predicate: UnboundPredicate[L]) -> BooleanExpression: +raise TypeError(f"Expected Bound Predicate, got: {predicate.term}") + +def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> BooleanExpression: +file_column_name = self.file_schema.find_column_name(predicate.term.ref().field.field_id) + +if not file_column_name: +raise ValueError(f"Not found in schema: {file_column_name}") + +if isinstance(predicate, BoundUnaryPredicate): +return predicate.as_unbound(file_column_name) +elif isinstance(predicate, BoundLiteralPredicate): +return predicate.as_unbound(file_column_name, predicate.literal) +elif isinstance(predicate, BoundSetPredicate): +return predicate.as_unbound(file_column_name, predicate.literals) +else: +raise ValueError(f"Unknown predicate: {predicate}") Review Comment: Good call -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID
Fokko commented on code in PR #6437: URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056876255 ## python/pyiceberg/types.py: ## @@ -268,6 +268,10 @@ def __init__(self, *fields: NestedField, **data: Any): data["fields"] = fields super().__init__(**data) +def by_id(self) -> Dict[int, NestedField]: Review Comment: Good call, updated! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID
Fokko commented on code in PR #6437: URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056876528 ## python/tests/avro/test_resolver.py: ## @@ -164,17 +163,17 @@ def test_resolver_change_type() -> None: def test_promote_int_to_long() -> None: -assert promote(IntegerType(), LongType()) == IntegerReader() +assert promote(IntegerType(), LongType()) == LongType() Review Comment: For those tests (and since it is in Avro), we want to use `resolve`. The difference is that with Resolve we check if the promotion is valid, but if we're going from float to double, we still want to read the float (since double is twice as many bytes), so we'll return a float. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID
Fokko commented on code in PR #6437: URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056877937 ## python/pyiceberg/schema.py: ## @@ -1046,3 +1055,79 @@ def _project_map(map_type: MapType, value_result: IcebergType) -> MapType: value_type=value_result, value_required=map_type.value_required, ) + + +@singledispatch +def promote(file_type: IcebergType, read_type: IcebergType) -> IcebergType: +"""Promotes reading a file type to a read type + +Args: +file_type (IcebergType): The type of the Avro file +read_type (IcebergType): The requested read type + +Raises: +ResolveException: If attempting to resolve an unrecognized object type +""" +raise ResolveException(f"Cannot promote {file_type} to {read_type}") + + +@promote.register(IntegerType) +def _(file_type: IntegerType, read_type: IcebergType) -> IcebergType: +if isinstance(read_type, LongType): +# Ints/Longs are binary compatible in Avro, so this is okay +return read_type +else: +raise ResolveException(f"Cannot promote an int to {read_type}") + + +@promote.register(FloatType) +def _(file_type: FloatType, read_type: IcebergType) -> IcebergType: +if isinstance(read_type, DoubleType): +# A double type is wider +return read_type +else: +raise ResolveException(f"Cannot promote an float to {read_type}") + + +@promote.register(StringType) +def _(file_type: StringType, read_type: IcebergType) -> IcebergType: +if isinstance(read_type, BinaryType): +return read_type +else: +raise ResolveException(f"Cannot promote an string to {read_type}") + + +@promote.register(BinaryType) +def _(file_type: BinaryType, read_type: IcebergType) -> IcebergType: +if isinstance(read_type, StringType): +return read_type +else: +raise ResolveException(f"Cannot promote an binary to {read_type}") + + +@promote.register(DecimalType) +def _(file_type: DecimalType, read_type: IcebergType) -> IcebergType: +if isinstance(read_type, DecimalType): +if file_type.precision <= read_type.precision and file_type.scale == file_type.scale: +return read_type +else: +raise ResolveException(f"Cannot reduce precision from {file_type} to {read_type}") +else: +raise ResolveException(f"Cannot promote an decimal to {read_type}") + + +@promote.register(StructType) +def _(file_type: StructType, read_type: IcebergType) -> IcebergType: Review Comment: Yes, we need this because we call it from the `field` method in the visitor, and a field also can have a StructType. We don't have enough information to call this on the primitive method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID
Fokko commented on code in PR #6437: URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056878066 ## python/pyiceberg/schema.py: ## @@ -1046,3 +1055,79 @@ def _project_map(map_type: MapType, value_result: IcebergType) -> MapType: value_type=value_result, value_required=map_type.value_required, ) + + +@singledispatch +def promote(file_type: IcebergType, read_type: IcebergType) -> IcebergType: +"""Promotes reading a file type to a read type + +Args: +file_type (IcebergType): The type of the Avro file +read_type (IcebergType): The requested read type + +Raises: +ResolveException: If attempting to resolve an unrecognized object type +""" +raise ResolveException(f"Cannot promote {file_type} to {read_type}") + + +@promote.register(IntegerType) +def _(file_type: IntegerType, read_type: IcebergType) -> IcebergType: +if isinstance(read_type, LongType): +# Ints/Longs are binary compatible in Avro, so this is okay +return read_type +else: +raise ResolveException(f"Cannot promote an int to {read_type}") + + +@promote.register(FloatType) +def _(file_type: FloatType, read_type: IcebergType) -> IcebergType: +if isinstance(read_type, DoubleType): +# A double type is wider +return read_type +else: +raise ResolveException(f"Cannot promote an float to {read_type}") + + +@promote.register(StringType) +def _(file_type: StringType, read_type: IcebergType) -> IcebergType: +if isinstance(read_type, BinaryType): +return read_type +else: +raise ResolveException(f"Cannot promote an string to {read_type}") + + +@promote.register(BinaryType) +def _(file_type: BinaryType, read_type: IcebergType) -> IcebergType: +if isinstance(read_type, StringType): +return read_type +else: +raise ResolveException(f"Cannot promote an binary to {read_type}") + + +@promote.register(DecimalType) +def _(file_type: DecimalType, read_type: IcebergType) -> IcebergType: +if isinstance(read_type, DecimalType): +if file_type.precision <= read_type.precision and file_type.scale == file_type.scale: +return read_type +else: +raise ResolveException(f"Cannot reduce precision from {file_type} to {read_type}") +else: +raise ResolveException(f"Cannot promote an decimal to {read_type}") + + +@promote.register(StructType) +def _(file_type: StructType, read_type: IcebergType) -> IcebergType: Review Comment: I would say we can rewrite: ```python def struct( self, struct: StructType, struct_array: Optional[pa.Array], field_results: List[Optional[pa.Array]] ) -> Optional[pa.Array]: if struct_array is None: return None field_arrays: List[pa.Array] = [] fields: List[pa.Field] = [] for pos, field_array in enumerate(field_results): field = struct.fields[pos] if field_array is not None: array = self.cast_if_needed(field, field_array) field_arrays.append(array) fields.append(pa.field(field.name, array.type, field.optional)) elif field.optional: arrow_type = schema_to_pyarrow(field.field_type) field_arrays.append(pa.nulls(len(struct_array)).cast(arrow_type)) fields.append(pa.field(field.name, arrow_type, field.optional)) else: raise ResolveException(f"Field is required, and could not be found in the file: {field}") return pa.StructArray.from_arrays(arrays=field_arrays, fields=pa.struct(fields)) def field(self, field: NestedField, _: Optional[pa.Array], field_array: Optional[pa.Array]) -> Optional[pa.Array]: return field_array ``` I think this is cleaner because we just handle the field on the field method: ```python def struct( self, struct: StructType, struct_array: Optional[pa.Array], field_results: List[Optional[pa.Array]] ) -> Optional[pa.Array]: if struct_array is None: return None return pa.StructArray.from_arrays(arrays=field_results, fields=pa.struct(schema_to_pyarrow(struct))) def field(self, field: NestedField, _: Optional[pa.Array], field_array: Optional[pa.Array]) -> Optional[pa.Array]: if field_array is not None: return self.cast_if_needed(field, field_array) elif field.optional: arrow_type = schema_to_pyarrow(field.field_type) return pa.nulls(3, type=arrow_type) # We need to find the length somehow else: raise ResolveError(f"Field is required, and could not be found in the file: {field}") ``` But then the field_type can still be a StructType -- This is an autom
[GitHub] [iceberg] github-actions[bot] commented on issue #5139: Historical time travel imports
github-actions[bot] commented on issue #5139: URL: https://github.com/apache/iceberg/issues/5139#issuecomment-1364598871 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] github-actions[bot] commented on issue #5141: No way to rollback first commit in table
github-actions[bot] commented on issue #5141: URL: https://github.com/apache/iceberg/issues/5141#issuecomment-1364598864 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #6449: WIP: Delta, Spark: Adding support for Migrating Delta Lake Table to Iceberg Table
JonasJ-ap commented on code in PR #6449: URL: https://github.com/apache/iceberg/pull/6449#discussion_r1056753664 ## data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java: ## @@ -161,7 +161,7 @@ private static Metrics getAvroMetrics(Path path, Configuration conf) { } } - private static Metrics getParquetMetrics( + public static Metrics getParquetMetrics( Review Comment: Thank you for your suggestion. It seems doing so would mean copy-pase the `getParquetMetrics` from `TableMigrationUtil` to `BaseMigrateDeltaLakeTableAction`. I think the trade-off here is between duplicated code section and exposing private methods. Given that the class named `TableMigrationUtil` is intended to provide util methods for table migration, do you think it may be proper to make these methods public in this case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] dependabot[bot] opened a new pull request, #6491: Build: Bump actions/stale from 6.0.1 to 7.0.0
dependabot[bot] opened a new pull request, #6491: URL: https://github.com/apache/iceberg/pull/6491 Bumps [actions/stale](https://github.com/actions/stale) from 6.0.1 to 7.0.0. Release notes Sourced from https://github.com/actions/stale/releases";>actions/stale's releases. v7.0.0 ⚠️ This version contains breaking changes ⚠️ What's Changed Allow daysBeforeStale options to be float by https://github.com/irega";>@​irega in https://github-redirect.dependabot.com/actions/stale/pull/841";>actions/stale#841 Use cache in check-dist.yml by https://github.com/jongwooo";>@​jongwooo in https://github-redirect.dependabot.com/actions/stale/pull/876";>actions/stale#876 fix print outputs step in existing workflows by https://github.com/irega";>@​irega in https://github-redirect.dependabot.com/actions/stale/pull/859";>actions/stale#859 Update issue and PR templates, add/delete workflow files by https://github.com/IvanZosimov";>@​IvanZosimov in https://github-redirect.dependabot.com/actions/stale/pull/880";>actions/stale#880 Update how stale handles exempt items by https://github.com/johnsudol";>@​johnsudol in https://github-redirect.dependabot.com/actions/stale/pull/874";>actions/stale#874 Breaking Changes In this release we prevent this action from managing the stale label on items included in exempt-issue-labels and exempt-pr-labels We decided that this is outside of the scope of this action, and to be left up to the maintainer New Contributors https://github.com/irega";>@​irega made their first contribution in https://github-redirect.dependabot.com/actions/stale/pull/841";>actions/stale#841 https://github.com/jongwooo";>@​jongwooo made their first contribution in https://github-redirect.dependabot.com/actions/stale/pull/876";>actions/stale#876 https://github.com/IvanZosimov";>@​IvanZosimov made their first contribution in https://github-redirect.dependabot.com/actions/stale/pull/880";>actions/stale#880 https://github.com/johnsudol";>@​johnsudol made their first contribution in https://github-redirect.dependabot.com/actions/stale/pull/874";>actions/stale#874 Full Changelog: https://github.com/actions/stale/compare/v6...v7.0.0";>https://github.com/actions/stale/compare/v6...v7.0.0 Changelog Sourced from https://github.com/actions/stale/blob/main/CHANGELOG.md";>actions/stale's changelog. [7.0.0] :warning: Breaking change :warning: Allow daysBeforeStale options to be float by https://github.com/irega";>@​irega in https://github-redirect.dependabot.com/actions/stale/pull/841";>actions/stale#841 Use cache in check-dist.yml by https://github.com/jongwooo";>@​jongwooo in https://github-redirect.dependabot.com/actions/stale/pull/876";>actions/stale#876 fix print outputs step in existing workflows by https://github.com/irega";>@​irega in https://github-redirect.dependabot.com/actions/stale/pull/859";>actions/stale#859 Update issue and PR templates, add/delete workflow files by https://github.com/IvanZosimov";>@​IvanZosimov in https://github-redirect.dependabot.com/actions/stale/pull/880";>actions/stale#880 Update how stale handles exempt items by https://github.com/johnsudol";>@​johnsudol in https://github-redirect.dependabot.com/actions/stale/pull/874";>actions/stale#874 Commits https://github.com/actions/stale/commit/6f05e4244c9a0b2ed3401882b05d701dd0a7289b";>6f05e42 draft release for v7.0.0 (https://github-redirect.dependabot.com/actions/stale/issues/888";>#888) https://github.com/actions/stale/commit/eed91cbd05d22d759c0f0d5ed5f7d050db69a4b3";>eed91cb Update how stale handles exempt items (https://github-redirect.dependabot.com/actions/stale/issues/874";>#874) https://github.com/actions/stale/commit/10dc265f2cba99a0569c7805cc8b7c907d7a8665";>10dc265 Merge pull request https://github-redirect.dependabot.com/actions/stale/issues/880";>#880 from akv-platform/update-stale-repo https://github.com/actions/stale/commit/9c1eb3ff7e7c7198567f5c5b9cfe03b002372812";>9c1eb3f Update .md files and allign build-test.yml with the current test.yml https://github.com/actions/stale/commit/bc357bdd1b5e8386f07a9e0417832a4b0a697076";>bc357bd Update .github/workflows/release-new-action-version.yml https://github.com/actions/stale/commit/690ede5a62a6b047b8a7742af26dd12ad1e1e1a7";>690ede5 Update .github/ISSUE_TEMPLATE/bug_report.md https://github.com/actions/stale/commit/afbcabf8c3724b59608575537417fb2664792a4a";>afbcabf Merge branch 'main' into update-stale-repo https://github.com/actions/stale/commit/e36441163135a3ba5bb01c58d5a0b318023beb93";>e364411 Update name of codeql.yml file https://github.com/actions/stale/commit/627cef3f3764759e961ae4e0788ea732c9b92e2c";>627cef3 fix print outputs step (https://github-redirect.dependabot.com/actions/stale/issues/859";>#859) https://github.com/actions/stale/commit/975308fb9d005fd1d6680ff07a34bb9d823213
[GitHub] [iceberg] dependabot[bot] opened a new pull request, #6492: Build: Bump adlfs from 2022.10.0 to 2022.11.2 in /python
dependabot[bot] opened a new pull request, #6492: URL: https://github.com/apache/iceberg/pull/6492 Bumps [adlfs](https://github.com/dask/adlfs) from 2022.10.0 to 2022.11.2. Changelog Sourced from https://github.com/fsspec/adlfs/blob/main/CHANGELOG.md";>adlfs's changelog. 2022.11.2 Reorder fs.info() to search the parent directory only after searching for the specified item directly Removed pin on upper bound for azure-storage-blob Moved AzureDatalakeFileSystem to a separate module, in acknowledgement of https://learn.microsoft.com/en-us/answers/questions/281107/azure-data-lake-storage-gen1-retirement-announceme.html%7D";>Microsoft End of Life notice Added DeprecationWarning to AzureDatalakeFilesystem 2022.10.1 Pin azure-storage-blob >=12.12.0,=1.23.1,<2.0.0 2022.9.1 Fixed missing dependency on aiohttp in package metadata. 2022.9.0 Add support to AzureBlobFileSystem for versioning Assure full uri's are left stripped to remove "" Set Python requires >=3.8 in setup.cfg _strip_protocol handle lists 2022.7.0 Fix overflow error when uploading files > 2GB 2022.04.0 Added support for Python 3.10 and pinned Python 3.8 Skip test_url due to bug in Azurite Added isort and update pre-commit v2022.02.0 Updated requirements to fsspec >= 2021.10.1 to fix https://github-redirect.dependabot.com/dask/adlfs/issues/280";>#280 Fixed deprecation warning in pytest_asyncio by setting asycio_mode = True v2021.10.1 Added support for Hierarchical Namespaces in Gen2 to enable multilevel Hive partitioned tables in pyarrow Registered abfss:// as an entrypoint instead of registering at runtime Implemented support for fsspec callbacks in put_file and get_file v2021.09.1 Fixed isdir() bug causing some directories to be labeled incorrectly as files Added flexible url handling to improve compatibility with other applications using Spark and fsspec v2021.08.1 ... (truncated) Commits https://github.com/fsspec/adlfs/commit/b01b30d5bbffcb0a49033cc4c90af1cba9da3fe2";>b01b30d Added deprecationwarning to azuredatalakefilesystem and removed it from extra... https://github.com/fsspec/adlfs/commit/0d47f381dd563ad68e6e6023b92bbee7843eed40";>0d47f38 Patch1 (https://github-redirect.dependabot.com/dask/adlfs/issues/372";>#372) https://github.com/fsspec/adlfs/commit/6b4f3d8e94f89ae6d6de0e59a7bc9ea48013bb14";>6b4f3d8 _ls: support detail (https://github-redirect.dependabot.com/dask/adlfs/issues/369";>#369) https://github.com/fsspec/adlfs/commit/5cd75d0ec75da65c508c9a994758e3ea32e4666f";>5cd75d0 Improve credential how-to description in README (https://github-redirect.dependabot.com/dask/adlfs/issues/367";>#367) https://github.com/fsspec/adlfs/commit/f9ac1a8b2bf226e64970bf4f33a6265cb437d2c9";>f9ac1a8 Fix info perf (https://github-redirect.dependabot.com/dask/adlfs/issues/370";>#370) https://github.com/fsspec/adlfs/commit/f15c37a43afd87a04f01b61cd90294dd57181e1d";>f15c37a Updated instructions for CONTRIBUTING.md to use colima (https://github-redirect.dependabot.com/dask/adlfs/issues/355";>#355) https://github.com/fsspec/adlfs/commit/6aa23f74d553b87ea529618653cc43ff11b83081";>6aa23f7 Fix info() performance (https://github-redirect.dependabot.com/dask/adlfs/issues/360";>#360) https://github.com/fsspec/adlfs/commit/6a3530fde02adc32a5bbff841973b0a131d2534d";>6a3530f Respect onerror='return' in AzureBlobFileSystem.cat (https://github-redirect.dependabot.com/dask/adlfs/issues/359";>#359) https://github.com/fsspec/adlfs/commit/065b71bd66505fc6f87e8809ebb216bfaef770fd";>065b71b Update changelog.md (https://github-redirect.dependabot.com/dask/adlfs/issues/356";>#356) See full diff in https://github.com/dask/adlfs/compare/2022.10.0...2022.11.2";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` w
[GitHub] [iceberg] dependabot[bot] opened a new pull request, #6493: Build: Bump coverage from 6.5.0 to 7.0.1 in /python
dependabot[bot] opened a new pull request, #6493: URL: https://github.com/apache/iceberg/pull/6493 Bumps [coverage](https://github.com/nedbat/coveragepy) from 6.5.0 to 7.0.1. Changelog Sourced from https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst";>coverage's changelog. Version 7.0.1 — 2022-12-23 When checking if a file mapping resolved to a file that exists, we weren't considering files in .whl files. This is now fixed, closing issue 1511_. File pattern rules were too strict, forbidding plus signs and curly braces in directory and file names. This is now fixed, closing issue 1513_. Unusual Unicode or control characters in source files could prevent reporting. This is now fixed, closing issue 1512_. The PyPy wheel now installs on PyPy 3.7, 3.8, and 3.9, closing issue 1510_. .. _issue 1510: https://github-redirect.dependabot.com/nedbat/coveragepy/issues/1510";>nedbat/coveragepy#1510 .. _issue 1511: https://github-redirect.dependabot.com/nedbat/coveragepy/issues/1511";>nedbat/coveragepy#1511 .. _issue 1512: https://github-redirect.dependabot.com/nedbat/coveragepy/issues/1512";>nedbat/coveragepy#1512 .. _issue 1513: https://github-redirect.dependabot.com/nedbat/coveragepy/issues/1513";>nedbat/coveragepy#1513 .. _changes_7-0-0: Version 7.0.0 — 2022-12-18 Nothing new beyond 7.0.0b1. .. _changes_7-0-0b1: Version 7.0.0b1 — 2022-12-03 A number of changes have been made to file path handling, including pattern matching and path remapping with the [paths] setting (see :ref:config_paths). These changes might affect you, and require you to update your settings. (This release includes the changes from 6.6.0b1_, since 6.6.0 was never released.) Changes to file pattern matching, which might require updating your configuration: Previously, * would incorrectly match directory separators, making precise matching difficult. This is now fixed, closing issue 1407_. Now ** matches any number of nested directories, including none. Improvements to combining data files when using the ... (truncated) Commits https://github.com/nedbat/coveragepy/commit/c5cda3aaf4ff6fe11a5b4120c05a73881246b58d";>c5cda3a docs: releases take a little bit longer now https://github.com/nedbat/coveragepy/commit/9d4226e9916595eaa4ffd48c6508d7be430ea28c";>9d4226e docs: latest sample HTML report https://github.com/nedbat/coveragepy/commit/8c777582776dcb152bece59c4f3076319503069f";>8c77758 docs: prep for 7.0.1 https://github.com/nedbat/coveragepy/commit/da1b282d3b39a6232e4cb798838389f7b16a0795";>da1b282 fix: also look into .whl files for source https://github.com/nedbat/coveragepy/commit/d327a70d9b81833c0ce22f2046b1d93892c1e72d";>d327a70 fix: more information when mapping rules aren't working right. https://github.com/nedbat/coveragepy/commit/35e249ff74cfcbc44889107cfcca785696dc4288";>35e249f fix: certain strange characters caused reporting to fail. https://github-redirect.dependabot.com/nedbat/coveragepy/issues/1512";>#1512 https://github.com/nedbat/coveragepy/commit/152cdc7a2b654b16fb572856d03097580e06e127";>152cdc7 fix: don't forbid plus signs in file names. https://github-redirect.dependabot.com/nedbat/coveragepy/issues/1513";>#1513 https://github.com/nedbat/coveragepy/commit/31513b4b0559fdc7578aa1a30e048b928dc46188";>31513b4 chore: make upgrade https://github.com/nedbat/coveragepy/commit/873b05997520bc27a28e63dd5654f5a9429afdb4";>873b059 test: don't run tests on Windows PyPy-3.9 https://github.com/nedbat/coveragepy/commit/5c5caa2489bb939b4bbc2af474384f82a5da6407";>5c5caa2 build: PyPy wheel now installs on 3.7, 3.8, and 3.9. https://github-redirect.dependabot.com/nedbat/coveragepy/issues/1510";>#1510 Additional commits viewable in https://github.com/nedbat/coveragepy/compare/6.5.0...7.0.1";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot sq
[GitHub] [iceberg] dependabot[bot] opened a new pull request, #6494: Build: Bump moto from 4.0.11 to 4.0.12 in /python
dependabot[bot] opened a new pull request, #6494: URL: https://github.com/apache/iceberg/pull/6494 Bumps [moto](https://github.com/spulec/moto) from 4.0.11 to 4.0.12. Changelog Sourced from https://github.com/spulec/moto/blob/master/CHANGELOG.md";>moto's changelog. 4.0.12 Docker Digest for 4.0.12: sha256:06916d3f310c68fd445468f06d6d4ae6f855e7f2b80e007a90bd11eeb421b5ed General: * Fixes our Kinesis-compatibility with botocore>=1.29.31 - earlier Moto-versions will connect to AWS when using this botocore-version New Methods: * Athena: * get_query_results() * list_query_executions() * RDS: * promote_read_replica() * Sagemaker: * create_pipeline() * delete_pipeline() * list_pipelines() Miscellaneous: * AWSLambda: publish_function() and update_function_code() now only increment the version if the source code has changed * CognitoIDP: Passwords are now validated using the PasswordPolicy (either supplied, or the default) * CloudFormation: create_stack() now propagates parameters StackPolicyBody and TimeoutInMinutes * CloudFormation: create_stack_instances() now returns the actual OperationId * CloudFormation: create_stack_set() now validates the provided name * CloudFormation: create_stack_set() now supports the DeploymentTargets-parameter * CloudFormation: create_stack_set() now actually creates the provided resources * CloudFormation: create_stack_set() now propagates parameters AdministrationRoleARN and ExecutionRoleName * CloudFormation: describe_stack_set() now returns the attributes Description, PermissionModel * CloudFormation: delete_stack_set() now validates that no instances are present before deleting the set * CloudWatch: get_metric_data() now supports the Label-parameter * EC2: allocate_address() now has improved behaviour for the Domain-parameter * EC2: create_volume() now supports the Iops-parameter * ECR: Improved ImageManifest support * KMS: describe_key() now returns an AccessDeniedException if the supplied policy does not allow this action * Route53: change_resource_record_sets() has additional validations * Route53: create_hosted_zone() now also creates a SOA-record by default * S3: put_object() now returns the ChecksumAlgorithm-attribute if supplied * SSM: describe_parameters() now has improved support for filtering by tags Commits https://github.com/spulec/moto/commit/626803a78e028878bf2df8ca17372170b6958bd9";>626803a Prepare release 4.0.12 (https://github-redirect.dependabot.com/spulec/moto/issues/5781";>#5781) https://github.com/spulec/moto/commit/42d8216623b720b267fdb0b54a9b178e164728a1";>42d8216 Techdebt: KMS: Mock RSA calls in some tests (https://github-redirect.dependabot.com/spulec/moto/issues/5782";>#5782) https://github.com/spulec/moto/commit/f67abbe1f3e0050b2bbacf0abd1098ea5273dce9";>f67abbe Add sagemaker mock call: delete_pipeline (https://github-redirect.dependabot.com/spulec/moto/issues/5780";>#5780) https://github.com/spulec/moto/commit/137f06b55e6ae0dcaddd18f833b193a71a739169";>137f06b KMS: Basic key policy enforcement (https://github-redirect.dependabot.com/spulec/moto/issues/5777";>#5777) https://github.com/spulec/moto/commit/ee6e8dd35997d542c153f045ce9e8a9cdcc2215a";>ee6e8dd Fix some Batch error message typos (https://github-redirect.dependabot.com/spulec/moto/issues/5779";>#5779) https://github.com/spulec/moto/commit/52891e1641de017dcab26373ae9cecc3a02632eb";>52891e1 EC2: Add iops to volume (https://github-redirect.dependabot.com/spulec/moto/issues/5776";>#5776) https://github.com/spulec/moto/commit/e5d40f63f89bbe7b4c91a575d5608ec9c7bda301";>e5d40f6 SageMaker: create_pipeline, list_pipelines (https://github-redirect.dependabot.com/spulec/moto/issues/5771";>#5771) https://github.com/spulec/moto/commit/2cf770f6974b716682d5430ac9f2ebd355467be8";>2cf770f ECR Manifest List Support (https://github-redirect.dependabot.com/spulec/moto/issues/5753";>#5753) https://github.com/spulec/moto/commit/16f9ff56a33857373a0e0c3548889e115600079e";>16f9ff5 Kinesis: Support new endpoint for botocore 1.29.31 (https://github-redirect.dependabot.com/spulec/moto/issues/5778";>#5778) https://github.com/spulec/moto/commit/07a8d6f0099f8365d0709a2800236aed7e02dcdc";>07a8d6f Add Athena: get_query_results, list_query_executions (https://github-redirect.dependabot.com/spulec/moto/issues/5648";>#5648) Additional commits viewable in https://github.com/spulec/moto/compare/4.0.11...4.0.12";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you
[GitHub] [iceberg] nickvazz commented on issue #6061: [Python] Add examples
nickvazz commented on issue #6061: URL: https://github.com/apache/iceberg/issues/6061#issuecomment-1364628562 Would love to see a minimal getting started / setup -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org