date:20221224

[GitHub] [iceberg-docs] quantsegu opened a new pull request, #192: Adjusted on comments for adding IOMETE as a vendor

2022-12-24 Thread GitBox



quantsegu opened a new pull request, #192:
URL: https://github.com/apache/iceberg-docs/pull/192

   The small textual adjustments are made. Can you please finalize this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a diff in pull request #5984: Core, API: Support incremental scanning with branch

2022-12-24 Thread GitBox



hililiwei commented on code in PR #5984:
URL: https://github.com/apache/iceberg/pull/5984#discussion_r1056798443


##
api/src/main/java/org/apache/iceberg/IncrementalScan.java:
##
@@ -21,6 +21,23 @@
 /** API for configuring an incremental scan. */
 public interface IncrementalScan>
 extends Scan {
+
+  /**
+   * Instructs this scan to look for changes starting from a particular 
snapshot (inclusive).
+   *
+   * If the start snapshot is not configured, it is defaulted to the oldest 
ancestor of the end
+   * snapshot (inclusive).
+   *
+   * @param fromSnapshotId the start snapshot ID (inclusive)
+   * @param referenceName the ref used
+   * @return this for method chaining
+   * @throws IllegalArgumentException if the start snapshot is not an ancestor 
of the end snapshot
+   */
+  default ThisT fromSnapshotInclusive(long fromSnapshotId, String 
referenceName) {

Review Comment:
   Agree with @stevenzwu.
   Yes, tag is a fixed point in time, but when using it for incremental read, 
we can think of it semantically the same as using `fromSnapshot(Long 
snapshotId)`.
   Just like the @stevenzwu's example, I have  daily tags(`20220101` 
`20220102`), If I want to read the incremental data from `20220102` to the 
current., so I can use `fromSnapshotExclusive("20220102")`:
   ```
   table.newIncrementalScan()
 .fromSnapshotExclusive("20220102")
 .planTasks()
   ```
   Another way is to use snapshot time to find the snapshot id first, but 
sometimes that doesn't work. For example, we can generate tags based on the 
event time of the data, or we tag the snapshot only after the application has 
completed. The application may finish at 3:00 2022/01/02, and tag the newly 
generated snapshot as `20220102`. If we use snapshot time `2022-01-02 00:00:00` 
to find the snapshot ID, incorrect incremental data will be return.
   
   cc @rdblue 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #6253: Flink: Write watermark to the snapshot summary

2022-12-24 Thread GitBox



hililiwei commented on PR #6253:
URL: https://github.com/apache/iceberg/pull/6253#issuecomment-1364513120

   >should we write the metadata as snapshot summary or table properties? It is 
minimal change to write as snapshot summary as shown in this PR or PR 
https://github.com/apache/iceberg/pull/2109. It is a little bigger change (like 
using transaction) to write as a table property, but it will be easier for 
consumer to extract the info.
   
   I think it's two aspects. It represents different meanings.  I include it in 
the summary because it does not represent the water mark for the table, but 
simply represents the current flink task that generated the snapshot.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #6160: Flink: Support locality with LocalitySplitAssigner

2022-12-24 Thread GitBox



hililiwei commented on PR #6160:
URL: https://github.com/apache/iceberg/pull/6160#issuecomment-1364514213

   @stevenzwu @openinx @rdblue @Fokko @pvary  could you please take a look at 
it when you get a chance? thx.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg-docs] RussellSpitzer merged pull request #192: Adjusted on comments for adding IOMETE as a vendor

2022-12-24 Thread GitBox



RussellSpitzer merged PR #192:
URL: https://github.com/apache/iceberg-docs/pull/192


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] ConeyLiu commented on pull request #3249: Optimized spark vectorized read parquet decimal

2022-12-24 Thread GitBox



ConeyLiu commented on PR #3249:
URL: https://github.com/apache/iceberg/pull/3249#issuecomment-1364527752

   @nastra just addressed the comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins

2022-12-24 Thread GitBox



aokolnychyi commented on code in PR #6371:
URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852402


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java:
##
@@ -356,4 +357,23 @@ private  GenericArrayData fillArray(
 
 return new GenericArrayData(array);
   }
+
+  @Override
+  public boolean equals(Object other) {

Review Comment:
   It is a bug in Spark that will be fixed. Spark uses `groupBy` in one place, 
which relies on equals.
   All other places use `InternalRowSet`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins

2022-12-24 Thread GitBox



aokolnychyi commented on code in PR #6371:
URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852402


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java:
##
@@ -356,4 +357,23 @@ private  GenericArrayData fillArray(
 
 return new GenericArrayData(array);
   }
+
+  @Override
+  public boolean equals(Object other) {

Review Comment:
   It is a bug in Spark that will be fixed. Spark uses `groupBy` in one place, 
which relies on `equals`.
   All other places use `InternalRowSet`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins

2022-12-24 Thread GitBox



aokolnychyi commented on code in PR #6371:
URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852535


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##
@@ -147,9 +152,9 @@ protected Statistics estimateStatistics(Snapshot snapshot) {
 
   @Override
   public String description() {
-String filters =
-
filterExpressions.stream().map(Spark3Util::describe).collect(Collectors.joining(",
 "));
-return String.format("%s [filters=%s]", table, filters);
+return String.format(
+"%s [filters=%s, groupedBy=%s]",
+table(), Spark3Util.describe(filterExpressions), groupingKeyType());

Review Comment:
   Switched to column names only.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins

2022-12-24 Thread GitBox



aokolnychyi commented on code in PR #6371:
URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852815


##
core/src/main/java/org/apache/iceberg/Partitioning.java:
##
@@ -200,6 +200,16 @@ public Void alwaysNull(int fieldId, String sourceName, int 
sourceId) {
   /**
* Builds a grouping key type considering all provided specs.
*
+   * @param specs one or many specs
+   * @return the constructed grouping key type
+   */
+  public static StructType groupingKeyType(Collection specs) {

Review Comment:
   Deprecated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins

2022-12-24 Thread GitBox



aokolnychyi commented on code in PR #6371:
URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852878


##
api/src/main/java/org/apache/iceberg/types/Types.java:
##
@@ -554,6 +554,10 @@ public List fields() {
   return lazyFieldList();
 }
 
+public boolean containsField(int id) {

Review Comment:
   Rewrote the other part and removed this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins

2022-12-24 Thread GitBox



aokolnychyi commented on code in PR #6371:
URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852903


##
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestCopyOnWriteUpdate.java:
##
@@ -140,4 +144,35 @@ public synchronized void 
testUpdateWithConcurrentTableRefresh() throws Exception
 executorService.shutdown();
 Assert.assertTrue("Timeout", executorService.awaitTermination(2, 
TimeUnit.MINUTES));
   }
+
+  @Test
+  public void testRuntimeFilteringWithReportedPartitioning() {
+createAndInitTable("id INT, dep STRING");
+sql("ALTER TABLE %s ADD PARTITION FIELD dep", tableName);
+
+append(tableName, "{ \"id\": 1, \"dep\": \"hr\" }\n" + "{ \"id\": 3, 
\"dep\": \"hr\" }");
+append(
+tableName,
+"{ \"id\": 1, \"dep\": \"hardware\" }\n" + "{ \"id\": 2, \"dep\": 
\"hardware\" }");
+
+Map sqlConf =
+ImmutableMap.of(
+SQLConf.V2_BUCKETING_ENABLED().key(),
+"true",
+SparkSQLProperties.PRESERVE_DATA_GROUPING,
+"true");
+
+withSQLConf(sqlConf, () -> sql("UPDATE %s SET id = cast('-1' AS INT) WHERE 
id = 2", tableName));

Review Comment:
   Removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins

2022-12-24 Thread GitBox



aokolnychyi commented on code in PR #6371:
URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852915


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java:
##
@@ -42,4 +42,9 @@ private SparkSQLProperties() {}
   // Controls whether to check the order of fields during writes
   public static final String CHECK_ORDERING = 
"spark.sql.iceberg.check-ordering";
   public static final boolean CHECK_ORDERING_DEFAULT = true;
+
+  // Controls whether to preserve the existing grouping of data while planning 
splits
+  public static final String PRESERVE_DATA_GROUPING =
+  "spark.sql.iceberg.split.preserve-data-grouping";

Review Comment:
   Switched to `planning`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins

2022-12-24 Thread GitBox



aokolnychyi commented on code in PR #6371:
URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056852950


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java:
##
@@ -0,0 +1,267 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.source;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.Collections;
+import java.util.List;
+import java.util.Set;
+import org.apache.iceberg.BaseScanTaskGroup;
+import org.apache.iceberg.PartitionField;
+import org.apache.iceberg.PartitionScanTask;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Scan;
+import org.apache.iceberg.ScanTask;
+import org.apache.iceberg.ScanTaskGroup;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.StructLike;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.spark.Spark3Util;
+import org.apache.iceberg.spark.SparkReadConf;
+import org.apache.iceberg.types.Types.StructType;
+import org.apache.iceberg.util.StructLikeSet;
+import org.apache.iceberg.util.TableScanUtil;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.connector.expressions.Transform;
+import org.apache.spark.sql.connector.read.SupportsReportPartitioning;
+import org.apache.spark.sql.connector.read.partitioning.KeyGroupedPartitioning;
+import org.apache.spark.sql.connector.read.partitioning.Partitioning;
+import org.apache.spark.sql.connector.read.partitioning.UnknownPartitioning;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+abstract class SparkPartitioningAwareScan extends 
SparkScan
+implements SupportsReportPartitioning {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(SparkPartitioningAwareScan.class);
+
+  private final Scan> scan;
+  private final boolean preserveDataGrouping;
+
+  private Set specs = null; // lazy cache of scanned specs
+  private List tasks = null; // lazy cache of uncombined tasks
+  private List> taskGroups = null; // lazy cache of task 
groups
+  private StructType groupingKeyType = null; // lazy cache of the grouping key 
type
+  private Transform[] groupingKeyTransforms = null; // lazy cache of grouping 
key transforms
+  private StructLikeSet groupingKeys = null; // lazy cache of grouping keys
+
+  SparkPartitioningAwareScan(
+  SparkSession spark,
+  Table table,
+  Scan> scan,
+  SparkReadConf readConf,
+  Schema expectedSchema,
+  List filters) {
+
+super(spark, table, readConf, expectedSchema, filters);
+
+this.scan = scan;
+this.preserveDataGrouping = readConf.preserveDataGrouping();
+
+if (scan == null) {
+  this.specs = Collections.emptySet();
+  this.tasks = Collections.emptyList();
+  this.taskGroups = Collections.emptyList();
+}
+  }
+
+  protected abstract Class taskJavaClass();
+
+  protected Scan> scan() {
+return scan;
+  }
+
+  @Override
+  public Partitioning outputPartitioning() {
+if (groupingKeyType().fields().isEmpty()) {
+  LOG.info("Reporting UnknownPartitioning with {} partition(s)", 
taskGroups().size());
+  return new UnknownPartitioning(taskGroups().size());
+} else {
+  LOG.info(
+  "Reporting KeyGroupedPartitioning by {} with {} partition(s)",
+  groupingKeyTransforms(),
+  taskGroups().size());
+  return new KeyGroupedPartitioning(groupingKeyTransforms(), 
taskGroups().size());
+}
+  }
+
+  @Override
+  protected StructType groupingKeyType() {
+if (groupingKeyType == null) {
+  if (preserveDataGrouping) {
+this.groupingKeyType = computeGroupingKeyType();
+  } else {
+this.groupingKeyType = StructType.of();
+  }
+}
+
+return groupingKeyType;
+  }
+
+  private StructType computeGroupingKeyType() {
+return org.apache.iceberg.Partitioning.groupingKeyType(expectedSchema(), 
specs(

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins

2022-12-24 Thread GitBox



aokolnychyi commented on code in PR #6371:
URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056862209


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java:
##
@@ -0,0 +1,267 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.source;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.Collections;
+import java.util.List;
+import java.util.Set;
+import org.apache.iceberg.BaseScanTaskGroup;
+import org.apache.iceberg.PartitionField;
+import org.apache.iceberg.PartitionScanTask;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Scan;
+import org.apache.iceberg.ScanTask;
+import org.apache.iceberg.ScanTaskGroup;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.StructLike;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.spark.Spark3Util;
+import org.apache.iceberg.spark.SparkReadConf;
+import org.apache.iceberg.types.Types.StructType;
+import org.apache.iceberg.util.StructLikeSet;
+import org.apache.iceberg.util.TableScanUtil;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.connector.expressions.Transform;
+import org.apache.spark.sql.connector.read.SupportsReportPartitioning;
+import org.apache.spark.sql.connector.read.partitioning.KeyGroupedPartitioning;
+import org.apache.spark.sql.connector.read.partitioning.Partitioning;
+import org.apache.spark.sql.connector.read.partitioning.UnknownPartitioning;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+abstract class SparkPartitioningAwareScan extends 
SparkScan
+implements SupportsReportPartitioning {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(SparkPartitioningAwareScan.class);
+
+  private final Scan> scan;
+  private final boolean preserveDataGrouping;
+
+  private Set specs = null; // lazy cache of scanned specs
+  private List tasks = null; // lazy cache of uncombined tasks
+  private List> taskGroups = null; // lazy cache of task 
groups
+  private StructType groupingKeyType = null; // lazy cache of the grouping key 
type
+  private Transform[] groupingKeyTransforms = null; // lazy cache of grouping 
key transforms
+  private StructLikeSet groupingKeys = null; // lazy cache of grouping keys
+
+  SparkPartitioningAwareScan(
+  SparkSession spark,
+  Table table,
+  Scan> scan,
+  SparkReadConf readConf,
+  Schema expectedSchema,
+  List filters) {
+
+super(spark, table, readConf, expectedSchema, filters);
+
+this.scan = scan;
+this.preserveDataGrouping = readConf.preserveDataGrouping();
+
+if (scan == null) {
+  this.specs = Collections.emptySet();
+  this.tasks = Collections.emptyList();
+  this.taskGroups = Collections.emptyList();
+}
+  }
+
+  protected abstract Class taskJavaClass();
+
+  protected Scan> scan() {
+return scan;
+  }
+
+  @Override
+  public Partitioning outputPartitioning() {
+if (groupingKeyType().fields().isEmpty()) {
+  LOG.info("Reporting UnknownPartitioning with {} partition(s)", 
taskGroups().size());
+  return new UnknownPartitioning(taskGroups().size());
+} else {
+  LOG.info(
+  "Reporting KeyGroupedPartitioning by {} with {} partition(s)",
+  groupingKeyTransforms(),
+  taskGroups().size());
+  return new KeyGroupedPartitioning(groupingKeyTransforms(), 
taskGroups().size());
+}
+  }
+
+  @Override
+  protected StructType groupingKeyType() {
+if (groupingKeyType == null) {
+  if (preserveDataGrouping) {
+this.groupingKeyType = computeGroupingKeyType();
+  } else {
+this.groupingKeyType = StructType.of();
+  }
+}
+
+return groupingKeyType;
+  }
+
+  private StructType computeGroupingKeyType() {
+return org.apache.iceberg.Partitioning.groupingKeyType(expectedSchema(), 
specs(

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins

2022-12-24 Thread GitBox



aokolnychyi commented on code in PR #6371:
URL: https://github.com/apache/iceberg/pull/6371#discussion_r1056862319


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java:
##
@@ -0,0 +1,267 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.source;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.Collections;
+import java.util.List;
+import java.util.Set;
+import org.apache.iceberg.BaseScanTaskGroup;
+import org.apache.iceberg.PartitionField;
+import org.apache.iceberg.PartitionScanTask;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Scan;
+import org.apache.iceberg.ScanTask;
+import org.apache.iceberg.ScanTaskGroup;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.StructLike;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.spark.Spark3Util;
+import org.apache.iceberg.spark.SparkReadConf;
+import org.apache.iceberg.types.Types.StructType;
+import org.apache.iceberg.util.StructLikeSet;
+import org.apache.iceberg.util.TableScanUtil;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.connector.expressions.Transform;
+import org.apache.spark.sql.connector.read.SupportsReportPartitioning;
+import org.apache.spark.sql.connector.read.partitioning.KeyGroupedPartitioning;
+import org.apache.spark.sql.connector.read.partitioning.Partitioning;
+import org.apache.spark.sql.connector.read.partitioning.UnknownPartitioning;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+abstract class SparkPartitioningAwareScan extends 
SparkScan
+implements SupportsReportPartitioning {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(SparkPartitioningAwareScan.class);
+
+  private final Scan> scan;
+  private final boolean preserveDataGrouping;
+
+  private Set specs = null; // lazy cache of scanned specs
+  private List tasks = null; // lazy cache of uncombined tasks
+  private List> taskGroups = null; // lazy cache of task 
groups
+  private StructType groupingKeyType = null; // lazy cache of the grouping key 
type
+  private Transform[] groupingKeyTransforms = null; // lazy cache of grouping 
key transforms
+  private StructLikeSet groupingKeys = null; // lazy cache of grouping keys
+
+  SparkPartitioningAwareScan(
+  SparkSession spark,
+  Table table,
+  Scan> scan,
+  SparkReadConf readConf,
+  Schema expectedSchema,
+  List filters) {
+
+super(spark, table, readConf, expectedSchema, filters);
+
+this.scan = scan;
+this.preserveDataGrouping = readConf.preserveDataGrouping();
+
+if (scan == null) {
+  this.specs = Collections.emptySet();
+  this.tasks = Collections.emptyList();
+  this.taskGroups = Collections.emptyList();
+}
+  }
+
+  protected abstract Class taskJavaClass();
+
+  protected Scan> scan() {
+return scan;
+  }
+
+  @Override
+  public Partitioning outputPartitioning() {
+if (groupingKeyType().fields().isEmpty()) {
+  LOG.info("Reporting UnknownPartitioning with {} partition(s)", 
taskGroups().size());
+  return new UnknownPartitioning(taskGroups().size());
+} else {
+  LOG.info(
+  "Reporting KeyGroupedPartitioning by {} with {} partition(s)",
+  groupingKeyTransforms(),
+  taskGroups().size());
+  return new KeyGroupedPartitioning(groupingKeyTransforms(), 
taskGroups().size());
+}
+  }
+
+  @Override
+  protected StructType groupingKeyType() {
+if (groupingKeyType == null) {
+  if (preserveDataGrouping) {
+this.groupingKeyType = computeGroupingKeyType();
+  } else {
+this.groupingKeyType = StructType.of();
+  }
+}
+
+return groupingKeyType;
+  }
+
+  private StructType computeGroupingKeyType() {
+return org.apache.iceberg.Partitioning.groupingKeyType(expectedSchema(), 
specs(

[GitHub] [iceberg] aokolnychyi merged pull request #6371: Spark 3.3: Support storage-partitioned joins

2022-12-24 Thread GitBox



aokolnychyi merged PR #6371:
URL: https://github.com/apache/iceberg/pull/6371


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #6371: Spark 3.3: Support storage-partitioned joins

2022-12-24 Thread GitBox



aokolnychyi commented on PR #6371:
URL: https://github.com/apache/iceberg/pull/6371#issuecomment-1364570187

   Thanks for reviewing, @RussellSpitzer @sunchao @zinking @rdblue!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #430: Support bucket table for Iceberg

2022-12-24 Thread GitBox



aokolnychyi commented on issue #430:
URL: https://github.com/apache/iceberg/issues/430#issuecomment-1364570547

   I am excited to announce that support for storage-partitioned joins has been 
merged into master.
   It will be shipped in 1.2.0. Thanks everyone involved, especially @sunchao. 
I am going to resolve this issue.
   See PR #6371.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi closed issue #430: Support bucket table for Iceberg

2022-12-24 Thread GitBox



aokolnychyi closed issue #430: Support bucket table for Iceberg
URL: https://github.com/apache/iceberg/issues/430


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID

2022-12-24 Thread GitBox



Fokko commented on code in PR #6437:
URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056874376


##
python/pyiceberg/io/pyarrow.py:
##
@@ -437,3 +465,198 @@ def visit_or(self, left_result: pc.Expression, 
right_result: pc.Expression) -> p
 
 def expression_to_pyarrow(expr: BooleanExpression) -> pc.Expression:
 return boolean_expression_visit(expr, _ConvertToArrowExpression())
+
+
+def project_table(
+files: Iterable[FileScanTask], table: Table, row_filter: 
BooleanExpression, projected_schema: Schema, case_sensitive: bool
+) -> pa.Table:
+"""Resolves the right columns based on the identifier
+
+Args:
+files(Iterable[FileScanTask]): A URI or a path to a local file
+table(Table): The table that's being queried
+row_filter(BooleanExpression): The expression for filtering rows
+projected_schema(Schema): The output schema
+case_sensitive(bool): Case sensitivity when looking up column names
+
+Raises:
+ResolveException: When an incompatible query is done
+"""
+
+if isinstance(table.io, PyArrowFileIO):
+scheme, path = PyArrowFileIO.parse_location(table.location())
+fs = table.io.get_fs(scheme)
+else:
+raise ValueError(f"Expected PyArrowFileIO, got: {table.io}")
+
+bound_row_filter = bind(table.schema(), row_filter, 
case_sensitive=case_sensitive)
+
+projected_field_ids = {
+id for id in projected_schema.field_ids if not 
isinstance(projected_schema.find_type(id), (MapType, ListType))
+}.union(extract_field_ids(bound_row_filter))
+
+tables = []
+for task in files:
+_, path = PyArrowFileIO.parse_location(task.file.file_path)
+
+# Get the schema
+with fs.open_input_file(path) as fout:
+parquet_schema = pq.read_schema(fout)
+schema_raw = parquet_schema.metadata.get(ICEBERG_SCHEMA)
+file_schema = Schema.parse_raw(schema_raw)
+
+pyarrow_filter = None
+if row_filter is not AlwaysTrue():
+translated_row_filter = translate_column_names(bound_row_filter, 
file_schema, case_sensitive=case_sensitive)
+bound_row_filter = bind(file_schema, translated_row_filter, 
case_sensitive=case_sensitive)
+pyarrow_filter = expression_to_pyarrow(bound_row_filter)
+
+file_project_schema = prune_columns(file_schema, projected_field_ids, 
select_full_types=False)
+
+if file_schema is None:
+raise ValueError(f"Missing Iceberg schema in Metadata for file: 
{path}")
+
+# Prune the stuff that we don't need anyway
+file_project_schema_arrow = schema_to_pyarrow(file_project_schema)
+
+arrow_table = ds.dataset(
+source=[path], schema=file_project_schema_arrow, 
format=ds.ParquetFileFormat(), filesystem=fs
+).to_table(filter=pyarrow_filter)
+
+tables.append(to_requested_schema(projected_schema, 
file_project_schema, arrow_table))
+
+if len(tables) > 1:
+return pa.concat_tables(tables)
+else:
+return tables[0]
+
+
+def to_requested_schema(requested_schema: Schema, file_schema: Schema, table: 
pa.Table) -> pa.Table:
+return VisitWithArrow(requested_schema, file_schema, table).visit()
+
+
+class VisitWithArrow:
+requested_schema: Schema
+file_schema: Schema
+table: pa.Table
+
+def __init__(self, requested_schema: Schema, file_schema: Schema, table: 
pa.Table) -> None:
+self.requested_schema = requested_schema
+self.file_schema = file_schema
+self.table = table
+
+def visit(self) -> pa.Table:
+return self.visit_with_arrow(self.requested_schema, self.file_schema)
+
+@singledispatchmethod
+def visit_with_arrow(self, requested_schema: Union[Schema, IcebergType], 
file_schema: Union[Schema, IcebergType]) -> pa.Table:
+"""A generic function for applying a schema visitor to any point 
within a schema
+
+The function traverses the schema in post-order fashion
+
+Args:
+obj(Schema | IcebergType): An instance of a Schema or an 
IcebergType
+visitor (VisitWithArrow[T]): An instance of an implementation of 
the generic VisitWithArrow base class
+
+Raises:
+NotImplementedError: If attempting to visit an unrecognized object 
type
+"""
+raise NotImplementedError(f"Cannot visit non-type: {requested_schema}")
+
+@visit_with_arrow.register(Schema)
+def _(self, requested_schema: Schema, file_schema: Schema) -> pa.Table:
+"""Visit a Schema with a concrete SchemaVisitorWithPartner"""
+struct_result = self.visit_with_arrow(requested_schema.as_struct(), 
file_schema.as_struct())
+pyarrow_schema = schema_to_pyarrow(requested_schema)
+return pa.Table.from_arrays(struct_result.flatten(), 
schema=pyarrow_schema)
+
+def _get_field_by_id(self, field_id: int) -> Optional[NestedField]:
+try:
+return sel

[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID

2022-12-24 Thread GitBox



Fokko commented on code in PR #6437:
URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056874464


##
python/pyiceberg/io/pyarrow.py:
##
@@ -437,3 +465,198 @@ def visit_or(self, left_result: pc.Expression, 
right_result: pc.Expression) -> p
 
 def expression_to_pyarrow(expr: BooleanExpression) -> pc.Expression:
 return boolean_expression_visit(expr, _ConvertToArrowExpression())
+
+
+def project_table(
+files: Iterable[FileScanTask], table: Table, row_filter: 
BooleanExpression, projected_schema: Schema, case_sensitive: bool
+) -> pa.Table:
+"""Resolves the right columns based on the identifier
+
+Args:
+files(Iterable[FileScanTask]): A URI or a path to a local file
+table(Table): The table that's being queried
+row_filter(BooleanExpression): The expression for filtering rows
+projected_schema(Schema): The output schema
+case_sensitive(bool): Case sensitivity when looking up column names
+
+Raises:
+ResolveException: When an incompatible query is done
+"""
+
+if isinstance(table.io, PyArrowFileIO):
+scheme, path = PyArrowFileIO.parse_location(table.location())
+fs = table.io.get_fs(scheme)
+else:
+raise ValueError(f"Expected PyArrowFileIO, got: {table.io}")
+
+bound_row_filter = bind(table.schema(), row_filter, 
case_sensitive=case_sensitive)
+
+projected_field_ids = {
+id for id in projected_schema.field_ids if not 
isinstance(projected_schema.find_type(id), (MapType, ListType))
+}.union(extract_field_ids(bound_row_filter))
+
+tables = []
+for task in files:
+_, path = PyArrowFileIO.parse_location(task.file.file_path)
+
+# Get the schema
+with fs.open_input_file(path) as fout:
+parquet_schema = pq.read_schema(fout)
+schema_raw = parquet_schema.metadata.get(ICEBERG_SCHEMA)
+file_schema = Schema.parse_raw(schema_raw)
+
+pyarrow_filter = None
+if row_filter is not AlwaysTrue():
+translated_row_filter = translate_column_names(bound_row_filter, 
file_schema, case_sensitive=case_sensitive)
+bound_row_filter = bind(file_schema, translated_row_filter, 
case_sensitive=case_sensitive)
+pyarrow_filter = expression_to_pyarrow(bound_row_filter)
+
+file_project_schema = prune_columns(file_schema, projected_field_ids, 
select_full_types=False)
+
+if file_schema is None:
+raise ValueError(f"Missing Iceberg schema in Metadata for file: 
{path}")
+
+# Prune the stuff that we don't need anyway
+file_project_schema_arrow = schema_to_pyarrow(file_project_schema)
+
+arrow_table = ds.dataset(
+source=[path], schema=file_project_schema_arrow, 
format=ds.ParquetFileFormat(), filesystem=fs
+).to_table(filter=pyarrow_filter)
+
+tables.append(to_requested_schema(projected_schema, 
file_project_schema, arrow_table))
+
+if len(tables) > 1:
+return pa.concat_tables(tables)
+else:
+return tables[0]
+
+
+def to_requested_schema(requested_schema: Schema, file_schema: Schema, table: 
pa.Table) -> pa.Table:
+return VisitWithArrow(requested_schema, file_schema, table).visit()
+
+
+class VisitWithArrow:
+requested_schema: Schema
+file_schema: Schema
+table: pa.Table
+
+def __init__(self, requested_schema: Schema, file_schema: Schema, table: 
pa.Table) -> None:
+self.requested_schema = requested_schema
+self.file_schema = file_schema
+self.table = table
+
+def visit(self) -> pa.Table:
+return self.visit_with_arrow(self.requested_schema, self.file_schema)
+
+@singledispatchmethod
+def visit_with_arrow(self, requested_schema: Union[Schema, IcebergType], 
file_schema: Union[Schema, IcebergType]) -> pa.Table:
+"""A generic function for applying a schema visitor to any point 
within a schema
+
+The function traverses the schema in post-order fashion
+
+Args:
+obj(Schema | IcebergType): An instance of a Schema or an 
IcebergType
+visitor (VisitWithArrow[T]): An instance of an implementation of 
the generic VisitWithArrow base class
+
+Raises:
+NotImplementedError: If attempting to visit an unrecognized object 
type
+"""
+raise NotImplementedError(f"Cannot visit non-type: {requested_schema}")
+
+@visit_with_arrow.register(Schema)
+def _(self, requested_schema: Schema, file_schema: Schema) -> pa.Table:
+"""Visit a Schema with a concrete SchemaVisitorWithPartner"""
+struct_result = self.visit_with_arrow(requested_schema.as_struct(), 
file_schema.as_struct())
+pyarrow_schema = schema_to_pyarrow(requested_schema)
+return pa.Table.from_arrays(struct_result.flatten(), 
schema=pyarrow_schema)
+
+def _get_field_by_id(self, field_id: int) -> Optional[NestedField]:
+try:
+return sel

[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID

2022-12-24 Thread GitBox



Fokko commented on code in PR #6437:
URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056874748


##
python/pyiceberg/io/pyarrow.py:
##
@@ -437,3 +465,198 @@ def visit_or(self, left_result: pc.Expression, 
right_result: pc.Expression) -> p
 
 def expression_to_pyarrow(expr: BooleanExpression) -> pc.Expression:
 return boolean_expression_visit(expr, _ConvertToArrowExpression())
+
+
+def project_table(
+files: Iterable[FileScanTask], table: Table, row_filter: 
BooleanExpression, projected_schema: Schema, case_sensitive: bool
+) -> pa.Table:
+"""Resolves the right columns based on the identifier
+
+Args:
+files(Iterable[FileScanTask]): A URI or a path to a local file
+table(Table): The table that's being queried
+row_filter(BooleanExpression): The expression for filtering rows
+projected_schema(Schema): The output schema
+case_sensitive(bool): Case sensitivity when looking up column names
+
+Raises:
+ResolveException: When an incompatible query is done
+"""
+
+if isinstance(table.io, PyArrowFileIO):
+scheme, path = PyArrowFileIO.parse_location(table.location())
+fs = table.io.get_fs(scheme)
+else:
+raise ValueError(f"Expected PyArrowFileIO, got: {table.io}")
+
+bound_row_filter = bind(table.schema(), row_filter, 
case_sensitive=case_sensitive)
+
+projected_field_ids = {
+id for id in projected_schema.field_ids if not 
isinstance(projected_schema.find_type(id), (MapType, ListType))

Review Comment:
   This is because we can't select the MapType/ListType:
   ```
   if not field.field_type.is_primitive:
   >   raise ValueError(
   f"Cannot explicitly project List or Map types, 
{field.field_id}:{field.name} of type {field.field_type} was selected"
   )
   E   ValueError: Cannot explicitly project List or Map types, 
5:ids of type list was selected
   ```
   We're only interested in the element in case of the list, and the key-value 
in case of a map.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID

2022-12-24 Thread GitBox



Fokko commented on code in PR #6437:
URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056875584


##
python/pyiceberg/schema.py:
##
@@ -1046,3 +1055,79 @@ def _project_map(map_type: MapType, value_result: 
IcebergType) -> MapType:
 value_type=value_result,
 value_required=map_type.value_required,
 )
+
+
+@singledispatch
+def promote(file_type: IcebergType, read_type: IcebergType) -> IcebergType:
+"""Promotes reading a file type to a read type
+
+Args:
+file_type (IcebergType): The type of the Avro file
+read_type (IcebergType): The requested read type
+
+Raises:
+ResolveException: If attempting to resolve an unrecognized object type
+"""
+raise ResolveException(f"Cannot promote {file_type} to {read_type}")
+
+
+@promote.register(IntegerType)
+def _(file_type: IntegerType, read_type: IcebergType) -> IcebergType:
+if isinstance(read_type, LongType):

Review Comment:
   The promote function is only called when the read and file are different.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID

2022-12-24 Thread GitBox



Fokko commented on code in PR #6437:
URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056875610


##
python/pyiceberg/schema.py:
##
@@ -1046,3 +1055,79 @@ def _project_map(map_type: MapType, value_result: 
IcebergType) -> MapType:
 value_type=value_result,
 value_required=map_type.value_required,
 )
+
+
+@singledispatch
+def promote(file_type: IcebergType, read_type: IcebergType) -> IcebergType:
+"""Promotes reading a file type to a read type
+
+Args:
+file_type (IcebergType): The type of the Avro file
+read_type (IcebergType): The requested read type
+
+Raises:
+ResolveException: If attempting to resolve an unrecognized object type
+"""
+raise ResolveException(f"Cannot promote {file_type} to {read_type}")

Review Comment:
   Works for me, updated 👍🏻 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID

2022-12-24 Thread GitBox



Fokko commented on code in PR #6437:
URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056875777


##
python/pyiceberg/exceptions.py:
##
@@ -86,3 +86,7 @@ class NotInstalledError(Exception):
 
 class SignError(Exception):
 """Raises when unable to sign a S3 request"""
+
+
+class ResolveException(Exception):

Review Comment:
   Good catch, updated 👍🏻 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID

2022-12-24 Thread GitBox



Fokko commented on code in PR #6437:
URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056875820


##
python/pyiceberg/expressions/visitors.py:
##
@@ -753,3 +756,89 @@ def inclusive_projection(
 schema: Schema, spec: PartitionSpec, case_sensitive: bool = True
 ) -> Callable[[BooleanExpression], BooleanExpression]:
 return InclusiveProjection(schema, spec, case_sensitive).project
+
+
+class _ColumnNameTranslator(BooleanExpressionVisitor[BooleanExpression]):
+"""Converts the column names with the ones in the actual file
+
+Args:
+  file_schema (Schema): The schema of the file
+  case_sensitive (bool): Whether to consider case when binding a reference 
to a field in a schema, defaults to True
+
+Raises:
+TypeError: In the case of an UnboundPredicate
+ValueError: When a column name cannot be found
+"""
+
+file_schema: Schema
+case_sensitive: bool
+
+def __init__(self, file_schema: Schema, case_sensitive: bool) -> None:
+self.file_schema = file_schema
+self.case_sensitive = case_sensitive
+
+def visit_true(self) -> BooleanExpression:
+return AlwaysTrue()
+
+def visit_false(self) -> BooleanExpression:
+return AlwaysFalse()
+
+def visit_not(self, child_result: BooleanExpression) -> BooleanExpression:
+return Not(child=child_result)
+
+def visit_and(self, left_result: BooleanExpression, right_result: 
BooleanExpression) -> BooleanExpression:
+return And(left=left_result, right=right_result)
+
+def visit_or(self, left_result: BooleanExpression, right_result: 
BooleanExpression) -> BooleanExpression:
+return Or(left=left_result, right=right_result)
+
+def visit_unbound_predicate(self, predicate: UnboundPredicate[L]) -> 
BooleanExpression:
+raise TypeError(f"Expected Bound Predicate, got: {predicate.term}")
+
+def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> 
BooleanExpression:
+file_column_name = 
self.file_schema.find_column_name(predicate.term.ref().field.field_id)
+
+if not file_column_name:
+raise ValueError(f"Not found in schema: {file_column_name}")

Review Comment:
   Great suggestion



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID

2022-12-24 Thread GitBox



Fokko commented on code in PR #6437:
URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056875859


##
python/pyiceberg/expressions/visitors.py:
##
@@ -753,3 +756,89 @@ def inclusive_projection(
 schema: Schema, spec: PartitionSpec, case_sensitive: bool = True
 ) -> Callable[[BooleanExpression], BooleanExpression]:
 return InclusiveProjection(schema, spec, case_sensitive).project
+
+
+class _ColumnNameTranslator(BooleanExpressionVisitor[BooleanExpression]):
+"""Converts the column names with the ones in the actual file
+
+Args:
+  file_schema (Schema): The schema of the file
+  case_sensitive (bool): Whether to consider case when binding a reference 
to a field in a schema, defaults to True
+
+Raises:
+TypeError: In the case of an UnboundPredicate
+ValueError: When a column name cannot be found
+"""
+
+file_schema: Schema
+case_sensitive: bool
+
+def __init__(self, file_schema: Schema, case_sensitive: bool) -> None:
+self.file_schema = file_schema
+self.case_sensitive = case_sensitive
+
+def visit_true(self) -> BooleanExpression:
+return AlwaysTrue()
+
+def visit_false(self) -> BooleanExpression:
+return AlwaysFalse()
+
+def visit_not(self, child_result: BooleanExpression) -> BooleanExpression:
+return Not(child=child_result)
+
+def visit_and(self, left_result: BooleanExpression, right_result: 
BooleanExpression) -> BooleanExpression:
+return And(left=left_result, right=right_result)
+
+def visit_or(self, left_result: BooleanExpression, right_result: 
BooleanExpression) -> BooleanExpression:
+return Or(left=left_result, right=right_result)
+
+def visit_unbound_predicate(self, predicate: UnboundPredicate[L]) -> 
BooleanExpression:
+raise TypeError(f"Expected Bound Predicate, got: {predicate.term}")
+
+def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> 
BooleanExpression:
+file_column_name = 
self.file_schema.find_column_name(predicate.term.ref().field.field_id)
+
+if not file_column_name:
+raise ValueError(f"Not found in schema: {file_column_name}")
+
+if isinstance(predicate, BoundUnaryPredicate):
+return predicate.as_unbound(file_column_name)
+elif isinstance(predicate, BoundLiteralPredicate):
+return predicate.as_unbound(file_column_name, predicate.literal)
+elif isinstance(predicate, BoundSetPredicate):
+return predicate.as_unbound(file_column_name, predicate.literals)
+else:
+raise ValueError(f"Unknown predicate: {predicate}")

Review Comment:
   Good call



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID

2022-12-24 Thread GitBox



Fokko commented on code in PR #6437:
URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056876255


##
python/pyiceberg/types.py:
##
@@ -268,6 +268,10 @@ def __init__(self, *fields: NestedField, **data: Any):
 data["fields"] = fields
 super().__init__(**data)
 
+def by_id(self) -> Dict[int, NestedField]:

Review Comment:
   Good call, updated!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID

2022-12-24 Thread GitBox



Fokko commented on code in PR #6437:
URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056876528


##
python/tests/avro/test_resolver.py:
##
@@ -164,17 +163,17 @@ def test_resolver_change_type() -> None:
 
 
 def test_promote_int_to_long() -> None:
-assert promote(IntegerType(), LongType()) == IntegerReader()
+assert promote(IntegerType(), LongType()) == LongType()

Review Comment:
   For those tests (and since it is in Avro), we want to use `resolve`. The 
difference is that with Resolve we check if the promotion is valid, but if 
we're going from float to double, we still want to read the float (since double 
is twice as many bytes), so we'll return a float.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID

2022-12-24 Thread GitBox



Fokko commented on code in PR #6437:
URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056877937


##
python/pyiceberg/schema.py:
##
@@ -1046,3 +1055,79 @@ def _project_map(map_type: MapType, value_result: 
IcebergType) -> MapType:
 value_type=value_result,
 value_required=map_type.value_required,
 )
+
+
+@singledispatch
+def promote(file_type: IcebergType, read_type: IcebergType) -> IcebergType:
+"""Promotes reading a file type to a read type
+
+Args:
+file_type (IcebergType): The type of the Avro file
+read_type (IcebergType): The requested read type
+
+Raises:
+ResolveException: If attempting to resolve an unrecognized object type
+"""
+raise ResolveException(f"Cannot promote {file_type} to {read_type}")
+
+
+@promote.register(IntegerType)
+def _(file_type: IntegerType, read_type: IcebergType) -> IcebergType:
+if isinstance(read_type, LongType):
+# Ints/Longs are binary compatible in Avro, so this is okay
+return read_type
+else:
+raise ResolveException(f"Cannot promote an int to {read_type}")
+
+
+@promote.register(FloatType)
+def _(file_type: FloatType, read_type: IcebergType) -> IcebergType:
+if isinstance(read_type, DoubleType):
+# A double type is wider
+return read_type
+else:
+raise ResolveException(f"Cannot promote an float to {read_type}")
+
+
+@promote.register(StringType)
+def _(file_type: StringType, read_type: IcebergType) -> IcebergType:
+if isinstance(read_type, BinaryType):
+return read_type
+else:
+raise ResolveException(f"Cannot promote an string to {read_type}")
+
+
+@promote.register(BinaryType)
+def _(file_type: BinaryType, read_type: IcebergType) -> IcebergType:
+if isinstance(read_type, StringType):
+return read_type
+else:
+raise ResolveException(f"Cannot promote an binary to {read_type}")
+
+
+@promote.register(DecimalType)
+def _(file_type: DecimalType, read_type: IcebergType) -> IcebergType:
+if isinstance(read_type, DecimalType):
+if file_type.precision <= read_type.precision and file_type.scale == 
file_type.scale:
+return read_type
+else:
+raise ResolveException(f"Cannot reduce precision from {file_type} 
to {read_type}")
+else:
+raise ResolveException(f"Cannot promote an decimal to {read_type}")
+
+
+@promote.register(StructType)
+def _(file_type: StructType, read_type: IcebergType) -> IcebergType:

Review Comment:
   Yes, we need this because we call it from the `field` method in the visitor, 
and a field also can have a StructType. We don't have enough information to 
call this on the primitive method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #6437: Python: Projection by Field ID

2022-12-24 Thread GitBox



Fokko commented on code in PR #6437:
URL: https://github.com/apache/iceberg/pull/6437#discussion_r1056878066


##
python/pyiceberg/schema.py:
##
@@ -1046,3 +1055,79 @@ def _project_map(map_type: MapType, value_result: 
IcebergType) -> MapType:
 value_type=value_result,
 value_required=map_type.value_required,
 )
+
+
+@singledispatch
+def promote(file_type: IcebergType, read_type: IcebergType) -> IcebergType:
+"""Promotes reading a file type to a read type
+
+Args:
+file_type (IcebergType): The type of the Avro file
+read_type (IcebergType): The requested read type
+
+Raises:
+ResolveException: If attempting to resolve an unrecognized object type
+"""
+raise ResolveException(f"Cannot promote {file_type} to {read_type}")
+
+
+@promote.register(IntegerType)
+def _(file_type: IntegerType, read_type: IcebergType) -> IcebergType:
+if isinstance(read_type, LongType):
+# Ints/Longs are binary compatible in Avro, so this is okay
+return read_type
+else:
+raise ResolveException(f"Cannot promote an int to {read_type}")
+
+
+@promote.register(FloatType)
+def _(file_type: FloatType, read_type: IcebergType) -> IcebergType:
+if isinstance(read_type, DoubleType):
+# A double type is wider
+return read_type
+else:
+raise ResolveException(f"Cannot promote an float to {read_type}")
+
+
+@promote.register(StringType)
+def _(file_type: StringType, read_type: IcebergType) -> IcebergType:
+if isinstance(read_type, BinaryType):
+return read_type
+else:
+raise ResolveException(f"Cannot promote an string to {read_type}")
+
+
+@promote.register(BinaryType)
+def _(file_type: BinaryType, read_type: IcebergType) -> IcebergType:
+if isinstance(read_type, StringType):
+return read_type
+else:
+raise ResolveException(f"Cannot promote an binary to {read_type}")
+
+
+@promote.register(DecimalType)
+def _(file_type: DecimalType, read_type: IcebergType) -> IcebergType:
+if isinstance(read_type, DecimalType):
+if file_type.precision <= read_type.precision and file_type.scale == 
file_type.scale:
+return read_type
+else:
+raise ResolveException(f"Cannot reduce precision from {file_type} 
to {read_type}")
+else:
+raise ResolveException(f"Cannot promote an decimal to {read_type}")
+
+
+@promote.register(StructType)
+def _(file_type: StructType, read_type: IcebergType) -> IcebergType:

Review Comment:
   I would say we can rewrite:
   ```python
   def struct(
   self, struct: StructType, struct_array: Optional[pa.Array], 
field_results: List[Optional[pa.Array]]
   ) -> Optional[pa.Array]:
   if struct_array is None:
   return None
   
   field_arrays: List[pa.Array] = []
   fields: List[pa.Field] = []
   for pos, field_array in enumerate(field_results):
   field = struct.fields[pos]
   if field_array is not None:
   array = self.cast_if_needed(field, field_array)
   field_arrays.append(array)
   fields.append(pa.field(field.name, array.type, 
field.optional))
   elif field.optional:
   arrow_type = schema_to_pyarrow(field.field_type)
   
field_arrays.append(pa.nulls(len(struct_array)).cast(arrow_type))
   fields.append(pa.field(field.name, arrow_type, 
field.optional))
   else:
   raise ResolveException(f"Field is required, and could not be 
found in the file: {field}")
   
   return pa.StructArray.from_arrays(arrays=field_arrays, 
fields=pa.struct(fields))
   
   def field(self, field: NestedField, _: Optional[pa.Array], field_array: 
Optional[pa.Array]) -> Optional[pa.Array]:
   return field_array
   ```
   I think this is cleaner because we just handle the field on the field method:
   
   ```python
   def struct(
   self, struct: StructType, struct_array: Optional[pa.Array], 
field_results: List[Optional[pa.Array]]
   ) -> Optional[pa.Array]:
   if struct_array is None:
   return None
   return pa.StructArray.from_arrays(arrays=field_results, 
fields=pa.struct(schema_to_pyarrow(struct)))
   
   def field(self, field: NestedField, _: Optional[pa.Array], field_array: 
Optional[pa.Array]) -> Optional[pa.Array]:
   if field_array is not None:
   return self.cast_if_needed(field, field_array)
   elif field.optional:
   arrow_type = schema_to_pyarrow(field.field_type)
   return pa.nulls(3, type=arrow_type)  # We need to find the 
length somehow
   else:
   raise ResolveError(f"Field is required, and could not be found 
in the file: {field}")
   ```
   
   But then the field_type can still be a StructType



-- 
This is an autom

[GitHub] [iceberg] github-actions[bot] commented on issue #5139: Historical time travel imports

2022-12-24 Thread GitBox



github-actions[bot] commented on issue #5139:
URL: https://github.com/apache/iceberg/issues/5139#issuecomment-1364598871

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] commented on issue #5141: No way to rollback first commit in table

2022-12-24 Thread GitBox



github-actions[bot] commented on issue #5141:
URL: https://github.com/apache/iceberg/issues/5141#issuecomment-1364598864

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #6449: WIP: Delta, Spark: Adding support for Migrating Delta Lake Table to Iceberg Table

2022-12-24 Thread GitBox



JonasJ-ap commented on code in PR #6449:
URL: https://github.com/apache/iceberg/pull/6449#discussion_r1056753664


##
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##
@@ -161,7 +161,7 @@ private static Metrics getAvroMetrics(Path path, 
Configuration conf) {
 }
   }
 
-  private static Metrics getParquetMetrics(
+  public static Metrics getParquetMetrics(

Review Comment:
   Thank you for your suggestion. It seems doing so would mean copy-pase the 
`getParquetMetrics` from `TableMigrationUtil` to 
`BaseMigrateDeltaLakeTableAction`. I think the trade-off here is between 
duplicated code section and exposing private methods. Given that the class 
named `TableMigrationUtil` is intended to provide util methods for table 
migration, do you think it may be proper to make these methods public in this 
case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] dependabot[bot] opened a new pull request, #6491: Build: Bump actions/stale from 6.0.1 to 7.0.0

2022-12-24 Thread GitBox



dependabot[bot] opened a new pull request, #6491:
URL: https://github.com/apache/iceberg/pull/6491

   Bumps [actions/stale](https://github.com/actions/stale) from 6.0.1 to 7.0.0.
   
   Release notes
   Sourced from https://github.com/actions/stale/releases";>actions/stale's 
releases.
   
   v7.0.0
   ⚠️ This version contains breaking changes ⚠️
   What's Changed
   
   Allow daysBeforeStale options to be float by https://github.com/irega";>@irega in https://github-redirect.dependabot.com/actions/stale/pull/841";>actions/stale#841
   Use cache in check-dist.yml by https://github.com/jongwooo";>@jongwooo in https://github-redirect.dependabot.com/actions/stale/pull/876";>actions/stale#876
   fix print outputs step in existing workflows by https://github.com/irega";>@irega in https://github-redirect.dependabot.com/actions/stale/pull/859";>actions/stale#859
   Update issue and PR templates, add/delete workflow files by https://github.com/IvanZosimov";>@IvanZosimov in https://github-redirect.dependabot.com/actions/stale/pull/880";>actions/stale#880
   Update how stale handles exempt items by https://github.com/johnsudol";>@johnsudol in https://github-redirect.dependabot.com/actions/stale/pull/874";>actions/stale#874
   
   Breaking Changes
   
   In this release we prevent this action from managing the 
stale label on items included in exempt-issue-labels 
and exempt-pr-labels
   We decided that this is outside of the scope of this action, and to be 
left up to the maintainer
   
   New Contributors
   
   https://github.com/irega";>@irega made their 
first contribution in https://github-redirect.dependabot.com/actions/stale/pull/841";>actions/stale#841
   https://github.com/jongwooo";>@jongwooo made 
their first contribution in https://github-redirect.dependabot.com/actions/stale/pull/876";>actions/stale#876
   https://github.com/IvanZosimov";>@IvanZosimov 
made their first contribution in https://github-redirect.dependabot.com/actions/stale/pull/880";>actions/stale#880
   https://github.com/johnsudol";>@johnsudol made 
their first contribution in https://github-redirect.dependabot.com/actions/stale/pull/874";>actions/stale#874
   
   Full Changelog: https://github.com/actions/stale/compare/v6...v7.0.0";>https://github.com/actions/stale/compare/v6...v7.0.0
   
   
   
   Changelog
   Sourced from https://github.com/actions/stale/blob/main/CHANGELOG.md";>actions/stale's 
changelog.
   
   [7.0.0]
   :warning: Breaking change :warning:
   
   Allow daysBeforeStale options to be float by https://github.com/irega";>@irega in https://github-redirect.dependabot.com/actions/stale/pull/841";>actions/stale#841
   Use cache in check-dist.yml by https://github.com/jongwooo";>@jongwooo in https://github-redirect.dependabot.com/actions/stale/pull/876";>actions/stale#876
   fix print outputs step in existing workflows by https://github.com/irega";>@irega in https://github-redirect.dependabot.com/actions/stale/pull/859";>actions/stale#859
   Update issue and PR templates, add/delete workflow files by https://github.com/IvanZosimov";>@IvanZosimov in https://github-redirect.dependabot.com/actions/stale/pull/880";>actions/stale#880
   Update how stale handles exempt items by https://github.com/johnsudol";>@johnsudol in https://github-redirect.dependabot.com/actions/stale/pull/874";>actions/stale#874
   
   
   
   
   Commits
   
   https://github.com/actions/stale/commit/6f05e4244c9a0b2ed3401882b05d701dd0a7289b";>6f05e42
 draft release for v7.0.0 (https://github-redirect.dependabot.com/actions/stale/issues/888";>#888)
   https://github.com/actions/stale/commit/eed91cbd05d22d759c0f0d5ed5f7d050db69a4b3";>eed91cb
 Update how stale handles exempt items (https://github-redirect.dependabot.com/actions/stale/issues/874";>#874)
   https://github.com/actions/stale/commit/10dc265f2cba99a0569c7805cc8b7c907d7a8665";>10dc265
 Merge pull request https://github-redirect.dependabot.com/actions/stale/issues/880";>#880 
from akv-platform/update-stale-repo
   https://github.com/actions/stale/commit/9c1eb3ff7e7c7198567f5c5b9cfe03b002372812";>9c1eb3f
 Update .md files and allign build-test.yml with the current test.yml
   https://github.com/actions/stale/commit/bc357bdd1b5e8386f07a9e0417832a4b0a697076";>bc357bd
 Update .github/workflows/release-new-action-version.yml
   https://github.com/actions/stale/commit/690ede5a62a6b047b8a7742af26dd12ad1e1e1a7";>690ede5
 Update .github/ISSUE_TEMPLATE/bug_report.md
   https://github.com/actions/stale/commit/afbcabf8c3724b59608575537417fb2664792a4a";>afbcabf
 Merge branch 'main' into update-stale-repo
   https://github.com/actions/stale/commit/e36441163135a3ba5bb01c58d5a0b318023beb93";>e364411
 Update name of codeql.yml file
   https://github.com/actions/stale/commit/627cef3f3764759e961ae4e0788ea732c9b92e2c";>627cef3
 fix print outputs step (https://github-redirect.dependabot.com/actions/stale/issues/859";>#859)
   https://github.com/actions/stale/commit/975308fb9d005fd1d6680ff07a34bb9d823213

[GitHub] [iceberg] dependabot[bot] opened a new pull request, #6492: Build: Bump adlfs from 2022.10.0 to 2022.11.2 in /python

2022-12-24 Thread GitBox



dependabot[bot] opened a new pull request, #6492:
URL: https://github.com/apache/iceberg/pull/6492

   Bumps [adlfs](https://github.com/dask/adlfs) from 2022.10.0 to 2022.11.2.
   
   Changelog
   Sourced from https://github.com/fsspec/adlfs/blob/main/CHANGELOG.md";>adlfs's 
changelog.
   
   2022.11.2
   
   Reorder fs.info() to search the parent directory only after searching 
for the specified item directly
   Removed pin on upper bound for azure-storage-blob
   Moved AzureDatalakeFileSystem to a separate module, in acknowledgement 
of https://learn.microsoft.com/en-us/answers/questions/281107/azure-data-lake-storage-gen1-retirement-announceme.html%7D";>Microsoft
 End of Life notice
   Added DeprecationWarning to AzureDatalakeFilesystem
   
   2022.10.1
   
   Pin azure-storage-blob >=12.12.0,=1.23.1,<2.0.0
   
   2022.9.1
   
   Fixed missing dependency on aiohttp in package 
metadata.
   
   2022.9.0
   
   Add support to AzureBlobFileSystem for versioning
   Assure full uri's are left stripped to remove ""
   Set Python requires >=3.8 in setup.cfg
   _strip_protocol handle lists
   
   2022.7.0
   
   Fix overflow error when uploading files > 2GB
   
   2022.04.0
   
   Added support for Python 3.10 and pinned Python 3.8
   Skip test_url due to bug in Azurite
   Added isort and update pre-commit
   
   v2022.02.0
   
   Updated requirements to fsspec >= 2021.10.1 to fix https://github-redirect.dependabot.com/dask/adlfs/issues/280";>#280
   Fixed deprecation warning in pytest_asyncio by setting asycio_mode = 
True
   
   v2021.10.1
   
   Added support for Hierarchical Namespaces in Gen2 to enable multilevel 
Hive partitioned tables in pyarrow
   Registered abfss:// as an entrypoint instead of registering at 
runtime
   Implemented support for fsspec callbacks in put_file and get_file
   
   v2021.09.1
   
   Fixed isdir() bug causing some directories to be labeled incorrectly as 
files
   Added flexible url handling to improve compatibility with other 
applications using Spark and fsspec
   
   v2021.08.1
   
   
   ... (truncated)
   
   
   Commits
   
   https://github.com/fsspec/adlfs/commit/b01b30d5bbffcb0a49033cc4c90af1cba9da3fe2";>b01b30d
 Added deprecationwarning to azuredatalakefilesystem and removed it from 
extra...
   https://github.com/fsspec/adlfs/commit/0d47f381dd563ad68e6e6023b92bbee7843eed40";>0d47f38
 Patch1 (https://github-redirect.dependabot.com/dask/adlfs/issues/372";>#372)
   https://github.com/fsspec/adlfs/commit/6b4f3d8e94f89ae6d6de0e59a7bc9ea48013bb14";>6b4f3d8
 _ls: support detail (https://github-redirect.dependabot.com/dask/adlfs/issues/369";>#369)
   https://github.com/fsspec/adlfs/commit/5cd75d0ec75da65c508c9a994758e3ea32e4666f";>5cd75d0
 Improve credential how-to description in README  (https://github-redirect.dependabot.com/dask/adlfs/issues/367";>#367)
   https://github.com/fsspec/adlfs/commit/f9ac1a8b2bf226e64970bf4f33a6265cb437d2c9";>f9ac1a8
 Fix info perf (https://github-redirect.dependabot.com/dask/adlfs/issues/370";>#370)
   https://github.com/fsspec/adlfs/commit/f15c37a43afd87a04f01b61cd90294dd57181e1d";>f15c37a
 Updated instructions for CONTRIBUTING.md to use colima (https://github-redirect.dependabot.com/dask/adlfs/issues/355";>#355)
   https://github.com/fsspec/adlfs/commit/6aa23f74d553b87ea529618653cc43ff11b83081";>6aa23f7
 Fix info() performance (https://github-redirect.dependabot.com/dask/adlfs/issues/360";>#360)
   https://github.com/fsspec/adlfs/commit/6a3530fde02adc32a5bbff841973b0a131d2534d";>6a3530f
 Respect onerror='return' in AzureBlobFileSystem.cat (https://github-redirect.dependabot.com/dask/adlfs/issues/359";>#359)
   https://github.com/fsspec/adlfs/commit/065b71bd66505fc6f87e8809ebb216bfaef770fd";>065b71b
 Update changelog.md (https://github-redirect.dependabot.com/dask/adlfs/issues/356";>#356)
   See full diff in https://github.com/dask/adlfs/compare/2022.10.0...2022.11.2";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=adlfs&package-manager=pip&previous-version=2022.10.0&new-version=2022.11.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` w

[GitHub] [iceberg] dependabot[bot] opened a new pull request, #6493: Build: Bump coverage from 6.5.0 to 7.0.1 in /python

2022-12-24 Thread GitBox



dependabot[bot] opened a new pull request, #6493:
URL: https://github.com/apache/iceberg/pull/6493

   Bumps [coverage](https://github.com/nedbat/coveragepy) from 6.5.0 to 7.0.1.
   
   Changelog
   Sourced from https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst";>coverage's 
changelog.
   
   Version 7.0.1 — 2022-12-23
   
   
   When checking if a file mapping resolved to a file that exists, we weren't
   considering files in .whl files.  This is now fixed, closing issue 
1511_.
   
   
   File pattern rules were too strict, forbidding plus signs and curly 
braces in
   directory and file names.  This is now fixed, closing issue 
1513_.
   
   
   Unusual Unicode or control characters in source files could prevent
   reporting.  This is now fixed, closing issue 1512_.
   
   
   The PyPy wheel now installs on PyPy 3.7, 3.8, and 3.9, closing 
issue 1510_.
   
   
   .. _issue 1510: https://github-redirect.dependabot.com/nedbat/coveragepy/issues/1510";>nedbat/coveragepy#1510
   .. _issue 1511: https://github-redirect.dependabot.com/nedbat/coveragepy/issues/1511";>nedbat/coveragepy#1511
   .. _issue 1512: https://github-redirect.dependabot.com/nedbat/coveragepy/issues/1512";>nedbat/coveragepy#1512
   .. _issue 1513: https://github-redirect.dependabot.com/nedbat/coveragepy/issues/1513";>nedbat/coveragepy#1513
   .. _changes_7-0-0:
   Version 7.0.0 — 2022-12-18
   Nothing new beyond 7.0.0b1.
   .. _changes_7-0-0b1:
   Version 7.0.0b1 — 2022-12-03
   A number of changes have been made to file path handling, including 
pattern
   matching and path remapping with the [paths] setting (see
   :ref:config_paths).  These changes might affect you, and 
require you to
   update your settings.
   (This release includes the changes from 6.6.0b1 
_, since
   6.6.0 was never released.)
   
   
   Changes to file pattern matching, which might require updating your
   configuration:
   
   
   Previously, * would incorrectly match directory separators, 
making
   precise matching difficult.  This is now fixed, closing issue 
1407_.
   
   
   Now ** matches any number of nested directories, including 
none.
   
   
   
   
   Improvements to combining data files when using the
   
   
   
   
   ... (truncated)
   
   
   Commits
   
   https://github.com/nedbat/coveragepy/commit/c5cda3aaf4ff6fe11a5b4120c05a73881246b58d";>c5cda3a
 docs: releases take a little bit longer now
   https://github.com/nedbat/coveragepy/commit/9d4226e9916595eaa4ffd48c6508d7be430ea28c";>9d4226e
 docs: latest sample HTML report
   https://github.com/nedbat/coveragepy/commit/8c777582776dcb152bece59c4f3076319503069f";>8c77758
 docs: prep for 7.0.1
   https://github.com/nedbat/coveragepy/commit/da1b282d3b39a6232e4cb798838389f7b16a0795";>da1b282
 fix: also look into .whl files for source
   https://github.com/nedbat/coveragepy/commit/d327a70d9b81833c0ce22f2046b1d93892c1e72d";>d327a70
 fix: more information when mapping rules aren't working right.
   https://github.com/nedbat/coveragepy/commit/35e249ff74cfcbc44889107cfcca785696dc4288";>35e249f
 fix: certain strange characters caused reporting to fail. https://github-redirect.dependabot.com/nedbat/coveragepy/issues/1512";>#1512
   https://github.com/nedbat/coveragepy/commit/152cdc7a2b654b16fb572856d03097580e06e127";>152cdc7
 fix: don't forbid plus signs in file names. https://github-redirect.dependabot.com/nedbat/coveragepy/issues/1513";>#1513
   https://github.com/nedbat/coveragepy/commit/31513b4b0559fdc7578aa1a30e048b928dc46188";>31513b4
 chore: make upgrade
   https://github.com/nedbat/coveragepy/commit/873b05997520bc27a28e63dd5654f5a9429afdb4";>873b059
 test: don't run tests on Windows PyPy-3.9
   https://github.com/nedbat/coveragepy/commit/5c5caa2489bb939b4bbc2af474384f82a5da6407";>5c5caa2
 build: PyPy wheel now installs on 3.7, 3.8, and 3.9. https://github-redirect.dependabot.com/nedbat/coveragepy/issues/1510";>#1510
   Additional commits viewable in https://github.com/nedbat/coveragepy/compare/6.5.0...7.0.1";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=coverage&package-manager=pip&previous-version=6.5.0&new-version=7.0.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot sq

[GitHub] [iceberg] dependabot[bot] opened a new pull request, #6494: Build: Bump moto from 4.0.11 to 4.0.12 in /python

2022-12-24 Thread GitBox



dependabot[bot] opened a new pull request, #6494:
URL: https://github.com/apache/iceberg/pull/6494

   Bumps [moto](https://github.com/spulec/moto) from 4.0.11 to 4.0.12.
   
   Changelog
   Sourced from https://github.com/spulec/moto/blob/master/CHANGELOG.md";>moto's 
changelog.
   
   4.0.12
   Docker Digest for 4.0.12: 
sha256:06916d3f310c68fd445468f06d6d4ae6f855e7f2b80e007a90bd11eeb421b5ed
   General:
   * Fixes our Kinesis-compatibility with botocore>=1.29.31 - earlier 
Moto-versions will connect to AWS when using this botocore-version 
   New Methods:
   * Athena:
   * get_query_results()
   * list_query_executions()
   * RDS:
   * promote_read_replica()
   * Sagemaker:
   * create_pipeline()
   * delete_pipeline()
   * list_pipelines()
   Miscellaneous:
   * AWSLambda: publish_function() and update_function_code() now only 
increment the version if the source code has changed
   * CognitoIDP: Passwords are now validated using the PasswordPolicy (either 
supplied, or the default)
   * CloudFormation: create_stack() now propagates parameters StackPolicyBody 
and TimeoutInMinutes
   * CloudFormation: create_stack_instances() now returns the actual OperationId
   * CloudFormation: create_stack_set() now validates the provided name
   * CloudFormation: create_stack_set() now supports the 
DeploymentTargets-parameter
   * CloudFormation: create_stack_set() now actually creates the provided 
resources
   * CloudFormation: create_stack_set() now propagates parameters 
AdministrationRoleARN and ExecutionRoleName
   * CloudFormation: describe_stack_set() now returns the attributes 
Description, PermissionModel
   * CloudFormation: delete_stack_set() now validates that no instances are 
present before deleting the set
   * CloudWatch: get_metric_data() now supports the Label-parameter
   * EC2: allocate_address() now has improved behaviour for the Domain-parameter
   * EC2: create_volume() now supports the Iops-parameter
   * ECR: Improved ImageManifest support
   * KMS: describe_key() now returns an AccessDeniedException if the supplied 
policy does not allow this action
   * Route53: change_resource_record_sets() has additional validations
   * Route53: create_hosted_zone() now also creates a SOA-record by default
   * S3: put_object() now returns the ChecksumAlgorithm-attribute if supplied
   * SSM: describe_parameters() now has improved support for filtering by tags
   
   
   
   
   Commits
   
   https://github.com/spulec/moto/commit/626803a78e028878bf2df8ca17372170b6958bd9";>626803a
 Prepare release 4.0.12 (https://github-redirect.dependabot.com/spulec/moto/issues/5781";>#5781)
   https://github.com/spulec/moto/commit/42d8216623b720b267fdb0b54a9b178e164728a1";>42d8216
 Techdebt: KMS: Mock RSA calls in some tests (https://github-redirect.dependabot.com/spulec/moto/issues/5782";>#5782)
   https://github.com/spulec/moto/commit/f67abbe1f3e0050b2bbacf0abd1098ea5273dce9";>f67abbe
 Add sagemaker mock call: delete_pipeline (https://github-redirect.dependabot.com/spulec/moto/issues/5780";>#5780)
   https://github.com/spulec/moto/commit/137f06b55e6ae0dcaddd18f833b193a71a739169";>137f06b
 KMS: Basic key policy enforcement (https://github-redirect.dependabot.com/spulec/moto/issues/5777";>#5777)
   https://github.com/spulec/moto/commit/ee6e8dd35997d542c153f045ce9e8a9cdcc2215a";>ee6e8dd
 Fix some Batch error message typos (https://github-redirect.dependabot.com/spulec/moto/issues/5779";>#5779)
   https://github.com/spulec/moto/commit/52891e1641de017dcab26373ae9cecc3a02632eb";>52891e1
 EC2: Add iops to volume (https://github-redirect.dependabot.com/spulec/moto/issues/5776";>#5776)
   https://github.com/spulec/moto/commit/e5d40f63f89bbe7b4c91a575d5608ec9c7bda301";>e5d40f6
 SageMaker: create_pipeline, list_pipelines (https://github-redirect.dependabot.com/spulec/moto/issues/5771";>#5771)
   https://github.com/spulec/moto/commit/2cf770f6974b716682d5430ac9f2ebd355467be8";>2cf770f
 ECR Manifest List Support (https://github-redirect.dependabot.com/spulec/moto/issues/5753";>#5753)
   https://github.com/spulec/moto/commit/16f9ff56a33857373a0e0c3548889e115600079e";>16f9ff5
 Kinesis: Support new endpoint for botocore 1.29.31 (https://github-redirect.dependabot.com/spulec/moto/issues/5778";>#5778)
   https://github.com/spulec/moto/commit/07a8d6f0099f8365d0709a2800236aed7e02dcdc";>07a8d6f
 Add Athena: get_query_results, list_query_executions (https://github-redirect.dependabot.com/spulec/moto/issues/5648";>#5648)
   Additional commits viewable in https://github.com/spulec/moto/compare/4.0.11...4.0.12";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=moto&package-manager=pip&previous-version=4.0.11&new-version=4.0.12)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you

[GitHub] [iceberg] nickvazz commented on issue #6061: [Python] Add examples

2022-12-24 Thread GitBox



nickvazz commented on issue #6061:
URL: https://github.com/apache/iceberg/issues/6061#issuecomment-1364628562

   Would love to see a minimal getting started / setup 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

40 matches

Mail list logo