[GitHub] [iceberg] Fokko opened a new pull request, #6580: Python: Bump pylint

2023-01-13 Thread GitBox


Fokko opened a new pull request, #6580:
URL: https://github.com/apache/iceberg/pull/6580

   And remove the temporary hack


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko commented on a diff in pull request #6525: Python: Refactor loading manifests

2023-01-13 Thread GitBox


Fokko commented on code in PR #6525:
URL: https://github.com/apache/iceberg/pull/6525#discussion_r1069592946


##
python/pyiceberg/avro/reader.py:
##
@@ -238,41 +249,50 @@ def skip(self, decoder: BinaryDecoder) -> None:
 return self.option.skip(decoder)
 
 
-class StructProtocolReader(Reader):
-create_struct: Callable[[], StructProtocol]
-fields: Tuple[Tuple[Optional[int], Reader], ...]
+class StructReader(Reader):
+field_readers: Tuple[Tuple[Optional[int], Reader], ...]
+create_struct: Type[StructProtocol]
+struct: Optional[StructType]
 
-def __init__(self, fields: Tuple[Tuple[Optional[int], Reader], ...], 
create_struct: Callable[[], StructProtocol]):
-self.create_struct = create_struct
-self.fields = fields
-
-def create_or_reuse(self, reuse: Optional[StructProtocol]) -> 
StructProtocol:
-if reuse:
-return reuse
-else:
-return self.create_struct()
+def __init__(
+self,
+field_readers: Tuple[Tuple[Optional[int], Reader], ...],
+create_struct: Optional[Type[StructProtocol]] = None,

Review Comment:
   Inspecting gives different results amongst different Python version 😭  It 
works fine locally on 3.10, but CI fails on 3.8.
   
   Looks like it doesn't look up the constructor of the class that it inherits 
from:
   ```
   
   FullArgSpec(args=['cls'], varargs='args', varkw='kwds', defaults=None, 
kwonlyargs=[], kwonlydefaults=None, annotations={})
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko commented on a diff in pull request #6525: Python: Refactor loading manifests

2023-01-13 Thread GitBox


Fokko commented on code in PR #6525:
URL: https://github.com/apache/iceberg/pull/6525#discussion_r1069602456


##
python/pyiceberg/avro/reader.py:
##
@@ -238,41 +249,50 @@ def skip(self, decoder: BinaryDecoder) -> None:
 return self.option.skip(decoder)
 
 
-class StructProtocolReader(Reader):
-create_struct: Callable[[], StructProtocol]
-fields: Tuple[Tuple[Optional[int], Reader], ...]
+class StructReader(Reader):
+field_readers: Tuple[Tuple[Optional[int], Reader], ...]
+create_struct: Type[StructProtocol]
+struct: Optional[StructType]
 
-def __init__(self, fields: Tuple[Tuple[Optional[int], Reader], ...], 
create_struct: Callable[[], StructProtocol]):
-self.create_struct = create_struct
-self.fields = fields
-
-def create_or_reuse(self, reuse: Optional[StructProtocol]) -> 
StructProtocol:
-if reuse:
-return reuse
-else:
-return self.create_struct()
+def __init__(
+self,
+field_readers: Tuple[Tuple[Optional[int], Reader], ...],
+create_struct: Optional[Type[StructProtocol]] = None,

Review Comment:
   Moving to an _it’s easier to ask for forgiveness than permission_ approach



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg-docs] InvisibleProgrammer commented on pull request #191: Fix sidebar

2023-01-13 Thread GitBox


InvisibleProgrammer commented on PR #191:
URL: https://github.com/apache/iceberg-docs/pull/191#issuecomment-1382031000

   @samredai : thx for the approve.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg-docs] RussellSpitzer merged pull request #191: Fix sidebar

2023-01-13 Thread GitBox


RussellSpitzer merged PR #191:
URL: https://github.com/apache/iceberg-docs/pull/191


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg-docs] RussellSpitzer commented on pull request #191: Fix sidebar

2023-01-13 Thread GitBox


RussellSpitzer commented on PR #191:
URL: https://github.com/apache/iceberg-docs/pull/191#issuecomment-1382043820

   Thanks for the review @samredai and thank you @InvisibleProgrammer for the pr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] szehon-ho opened a new pull request, #6581: Spark 3.3: Add RemoveDanglingDeletes action

2023-01-13 Thread GitBox


szehon-ho opened a new pull request, #6581:
URL: https://github.com/apache/iceberg/pull/6581

   his adds an action to cleanup dangling (invalid) DeleteFiles that may 
otherwise keep getting carried over with the table's current snapshot.
   
   The problem and design doc is here: 
https://docs.google.com/document/d/11d-cIUR_89kRsMmWnEoxXGZCvp7L4TUmPJqUC60zB5M/edit#
   
   In a nutshell, the current table-wide mechanism is crude and may miss many 
instances of aging off invalid DeleteFiles, even after compaction, and 
negatively impact performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] krvikash commented on a diff in pull request #6499: AWS, Core, Hive: Fix `checkCommitStatus` when create table commit fails

2023-01-13 Thread GitBox


krvikash commented on code in PR #6499:
URL: https://github.com/apache/iceberg/pull/6499#discussion_r1069695013


##
hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCommits.java:
##
@@ -307,6 +312,61 @@ public void testAlreadyExistsException() {
 () -> catalog.createTable(TABLE_IDENTIFIER, schema, 
PartitionSpec.unpartitioned()));
   }
 
+  @Test
+  public void testCreateTableCommitFailure() throws TException, 
InterruptedException {

Review Comment:
   Thanks, @nastra for the review.
   
   Yes, you are right `testCreateTableCommitFailure` will not reach 
`checkCommitStatus `. I have not seen any test case which covers create table 
failure which is why I have added this test case to verify the create table 
failure where we are not required to check the commit status. 
   
   Now I have separated this change into a new commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] stevenzwu merged pull request #6572: Flink: backport PR #6337, PR #6426, PR #6557 to Flink 1.14 and 1.15 f…

2023-01-13 Thread GitBox


stevenzwu merged PR #6572:
URL: https://github.com/apache/iceberg/pull/6572


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] stevenzwu commented on pull request #6572: Flink: backport PR #6337, PR #6426, PR #6557 to Flink 1.14 and 1.15 f…

2023-01-13 Thread GitBox


stevenzwu commented on PR #6572:
URL: https://github.com/apache/iceberg/pull/6572#issuecomment-1382135319

   thanks @pvary for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] stevenzwu merged pull request #6401: Flink: Change to oldestAncestorAfter

2023-01-13 Thread GitBox


stevenzwu merged PR #6401:
URL: https://github.com/apache/iceberg/pull/6401


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] stevenzwu commented on pull request #6401: Flink: Change to oldestAncestorAfter

2023-01-13 Thread GitBox


stevenzwu commented on PR #6401:
URL: https://github.com/apache/iceberg/pull/6401#issuecomment-1382149280

   thanks @hililiwei for the contribution


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] stevenzwu merged pull request #6222: Flink: Support inspecting table

2023-01-13 Thread GitBox


stevenzwu merged PR #6222:
URL: https://github.com/apache/iceberg/pull/6222


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] stevenzwu commented on pull request #6222: Flink: Support inspecting table

2023-01-13 Thread GitBox


stevenzwu commented on PR #6222:
URL: https://github.com/apache/iceberg/pull/6222#issuecomment-1382153049

   thanks @hililiwei for contributing this major feature


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6517: Parquet: Fixes Incorrect Skipping of RowGroups with NaNs

2023-01-13 Thread GitBox


rdblue commented on code in PR #6517:
URL: https://github.com/apache/iceberg/pull/6517#discussion_r1069784838


##
parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java:
##
@@ -560,24 +560,26 @@ private  T max(Statistics statistics, int id) {
   }
 
   /**
-   * Checks against older versions of Parquet statistics which may have a null 
count but undefined
-   * min and max statistics. Returns true if nonNull values exist in the row 
group but no further
-   * statistics are available.
-   *
-   * We can't use {@code statistics.hasNonNullValue()} because it is 
inaccurate with older files
-   * and will return false if min and max are not set.
+   * Older versions of Parquet statistics which may have a null count but 
undefined min and max
+   * statistics. This is similar to the current behavior when NaN values are 
present.
*
* This is specifically for 1.5.0-CDH Parquet builds and later which 
contain the different
* unusual hasNonNull behavior. OSS Parquet builds are not effected because 
PARQUET-251 prohibits
* the reading of these statistics from versions of Parquet earlier than 
1.8.0.
*
* @param statistics Statistics to check
-   * @param valueCount Number of values in the row group
-   * @return true if nonNull values exist and no other stats can be used
+   * @return true if min and max statistics are null
*/
-  static boolean hasNonNullButNoMinMax(Statistics statistics, long valueCount) 
{
-return statistics.getNumNulls() < valueCount
-&& (statistics.getMaxBytes() == null || statistics.getMinBytes() == 
null);
+  static boolean nullMinMax(Statistics statistics) {
+return statistics.getMaxBytes() == null || statistics.getMinBytes() == 
null;
+  }
+
+  static boolean minMaxUndefined(Statistics statistics) {
+return (!statistics.isEmpty() && !statistics.hasNonNullValue()) || 
nullMinMax(statistics);

Review Comment:
   Sounds good. It's odd that this is not based on the number of values and 
number of nulls. Okay though, I guess it's just a strange API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue merged pull request #6517: Parquet: Fixes Incorrect Skipping of RowGroups with NaNs

2023-01-13 Thread GitBox


rdblue merged PR #6517:
URL: https://github.com/apache/iceberg/pull/6517


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue closed issue #6516: Parquet: Metric Row Group Filter handles Undefined Min/Max incorrectly Missing Rows

2023-01-13 Thread GitBox


rdblue closed issue #6516: Parquet: Metric Row Group Filter handles Undefined 
Min/Max incorrectly Missing Rows
URL: https://github.com/apache/iceberg/issues/6516


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6517: Parquet: Fixes Incorrect Skipping of RowGroups with NaNs

2023-01-13 Thread GitBox


RussellSpitzer commented on code in PR #6517:
URL: https://github.com/apache/iceberg/pull/6517#discussion_r1069785830


##
parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java:
##
@@ -560,24 +560,26 @@ private  T max(Statistics statistics, int id) {
   }
 
   /**
-   * Checks against older versions of Parquet statistics which may have a null 
count but undefined
-   * min and max statistics. Returns true if nonNull values exist in the row 
group but no further
-   * statistics are available.
-   *
-   * We can't use {@code statistics.hasNonNullValue()} because it is 
inaccurate with older files
-   * and will return false if min and max are not set.
+   * Older versions of Parquet statistics which may have a null count but 
undefined min and max
+   * statistics. This is similar to the current behavior when NaN values are 
present.
*
* This is specifically for 1.5.0-CDH Parquet builds and later which 
contain the different
* unusual hasNonNull behavior. OSS Parquet builds are not effected because 
PARQUET-251 prohibits
* the reading of these statistics from versions of Parquet earlier than 
1.8.0.
*
* @param statistics Statistics to check
-   * @param valueCount Number of values in the row group
-   * @return true if nonNull values exist and no other stats can be used
+   * @return true if min and max statistics are null
*/
-  static boolean hasNonNullButNoMinMax(Statistics statistics, long valueCount) 
{
-return statistics.getNumNulls() < valueCount
-&& (statistics.getMaxBytes() == null || statistics.getMinBytes() == 
null);
+  static boolean nullMinMax(Statistics statistics) {
+return statistics.getMaxBytes() == null || statistics.getMinBytes() == 
null;
+  }
+
+  static boolean minMaxUndefined(Statistics statistics) {
+return (!statistics.isEmpty() && !statistics.hasNonNullValue()) || 
nullMinMax(statistics);

Review Comment:
   It is quite strange



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6575: Spark 3.3: support version travel by reference name

2023-01-13 Thread GitBox


jackye1995 commented on code in PR #6575:
URL: https://github.com/apache/iceberg/pull/6575#discussion_r1069789840


##
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java:
##
@@ -231,6 +233,68 @@ public void testVersionAsOf() {
 assertEquals("Snapshot at specific ID " + snapshotId, expected, fromDF);
   }
 
+  @Test
+  public void testTagReferenceAsOf() {
+Table table = validationCatalog.loadTable(tableIdent);
+long snapshotId = table.currentSnapshot().snapshotId();
+table.manageSnapshots().createTag("test_tag", snapshotId).commit();
+
+// create a second snapshot, read the table at the snapshot
+List expected = sql("SELECT * FROM %s", tableName);
+sql("INSERT INTO %s VALUES (4, 'd', 4.0), (5, 'e', 5.0)", tableName);
+List actual1 = sql("SELECT * FROM %s VERSION AS OF 'test_tag'", 
tableName);
+assertEquals("Snapshot at specific tag reference name", expected, actual1);
+
+// read the table at the snapshot
+// HIVE time travel syntax
+List actual2 = sql("SELECT * FROM %s FOR SYSTEM_VERSION AS OF 
'test_tag'", tableName);
+assertEquals("Snapshot at specific tag reference name", expected, actual2);
+
+// read the table using DataFrameReader option: branch
+Dataset df =
+spark.read().format("iceberg").option(SparkReadOptions.TAG, 
"test_tag").load(tableName);
+List fromDF = rowsToJava(df.collectAsList());
+assertEquals("Snapshot at specific tag reference name", expected, fromDF);
+  }
+
+  @Test
+  public void testBranchReferenceAsOf() {
+Table table = validationCatalog.loadTable(tableIdent);
+long snapshotId = table.currentSnapshot().snapshotId();
+table.manageSnapshots().createBranch("test_branch", snapshotId).commit();
+
+// create a second snapshot, read the table at the snapshot
+List expected = sql("SELECT * FROM %s", tableName);
+sql("INSERT INTO %s VALUES (4, 'd', 4.0), (5, 'e', 5.0)", tableName);
+List actual1 = sql("SELECT * FROM %s VERSION AS OF 
'test_branch'", tableName);
+assertEquals("Snapshot at specific branch reference name", expected, 
actual1);
+
+// read the table at the snapshot
+// HIVE time travel syntax
+List actual2 =
+sql("SELECT * FROM %s FOR SYSTEM_VERSION AS OF 'test_branch'", 
tableName);
+assertEquals("Snapshot at specific branch reference name", expected, 
actual2);
+
+// read the table using DataFrameReader option: branch
+Dataset df =
+spark
+.read()
+.format("iceberg")
+.option(SparkReadOptions.BRANCH, "test_branch")
+.load(tableName);
+List fromDF = rowsToJava(df.collectAsList());
+assertEquals("Snapshot at specific branch reference name", expected, 
fromDF);
+  }
+
+  @Test
+  public void testUnknownReferenceAsOf() {
+AssertHelpers.assertThrows(
+"Cannot find matching snapshot ID or reference name for version",

Review Comment:
   sure, I can update to use that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6575: Spark 3.3: support version travel by reference name

2023-01-13 Thread GitBox


jackye1995 commented on code in PR #6575:
URL: https://github.com/apache/iceberg/pull/6575#discussion_r1069798452


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java:
##
@@ -159,7 +160,15 @@ public Table loadTable(Identifier ident, String version) 
throws NoSuchTableExcep
   sparkTable.snapshotId() == null,
   "Cannot do time-travel based on both table identifier and AS OF");
 
-  return sparkTable.copyWithSnapshotId(Long.parseLong(version));
+  try {
+return sparkTable.copyWithSnapshotId(Long.parseLong(version));
+  } catch (NumberFormatException e) {
+SnapshotRef ref = sparkTable.table().refs().get(version);
+ValidationException.check(
+ref != null,
+"Cannot find matching snapshot ID or reference name for version " 
+ version);
+return sparkTable.copyWithSnapshotId(ref.snapshotId());

Review Comment:
   I think one thing that might be useful is to time travel in a branch, 
something like `FOR SYSTEM_VERSION AS OF branchName@123456789012`. But that 
feels very hacky, I' rather have some syntax as we have been suggesting like 
`SELECT * FROM table BRANCH branch FOR SYSTEM_TIME AS OF xxx`. So I am leaving 
that part out of the implementation for now. At least I think most people can 
agree that a tag/branch head can be viewed as a version to travel to.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests

2023-01-13 Thread GitBox


rdblue commented on code in PR #6525:
URL: https://github.com/apache/iceberg/pull/6525#discussion_r1069808528


##
python/tests/expressions/test_evaluator.py:
##
@@ -52,112 +54,126 @@
 )
 
 
+def _record_simple(id: int, data: Optional[str]) -> Record:  # pylint: 
disable=redefined-builtin
+r = Record(struct=SIMPLE_SCHEMA.as_struct())
+r[0] = id
+r[1] = data
+return r
+
+
+def _record_float(id: float, f: float) -> Record:  # pylint: 
disable=redefined-builtin
+r = Record(struct=FLOAT_SCHEMA.as_struct())
+r[0] = id
+r[1] = f
+return r
+
+
 def test_true() -> None:
 evaluate = expression_evaluator(SIMPLE_SCHEMA, AlwaysTrue(), 
case_sensitive=True)
-assert evaluate(Record(1, "a"))
+assert evaluate(_record_simple(1, "a"))
 
 
 def test_false() -> None:
 evaluate = expression_evaluator(SIMPLE_SCHEMA, AlwaysFalse(), 
case_sensitive=True)
-assert not evaluate(Record(1, "a"))
+assert not evaluate(_record_simple(1, "a"))
 
 
 def test_less_than() -> None:
 evaluate = expression_evaluator(SIMPLE_SCHEMA, LessThan("id", 3), 
case_sensitive=True)
-assert evaluate(Record(2, "a"))
-assert not evaluate(Record(3, "a"))
+assert evaluate(_record_simple(2, "a"))
+assert not evaluate(_record_simple(3, "a"))
 
 
 def test_less_than_or_equal() -> None:
 evaluate = expression_evaluator(SIMPLE_SCHEMA, LessThanOrEqual("id", 3), 
case_sensitive=True)
-assert evaluate(Record(1, "a"))
-assert evaluate(Record(3, "a"))
-assert not evaluate(Record(4, "a"))
+assert evaluate(_record_simple(1, "a"))
+assert evaluate(_record_simple(3, "a"))
+assert not evaluate(_record_simple(4, "a"))
 
 
 def test_greater_than() -> None:
 evaluate = expression_evaluator(SIMPLE_SCHEMA, GreaterThan("id", 3), 
case_sensitive=True)
-assert not evaluate(Record(1, "a"))
-assert not evaluate(Record(3, "a"))
-assert evaluate(Record(4, "a"))
+assert not evaluate(_record_simple(1, "a"))
+assert not evaluate(_record_simple(3, "a"))
+assert evaluate(_record_simple(4, "a"))
 
 
 def test_greater_than_or_equal() -> None:
 evaluate = expression_evaluator(SIMPLE_SCHEMA, GreaterThanOrEqual("id", 
3), case_sensitive=True)
-assert not evaluate(Record(2, "a"))
-assert evaluate(Record(3, "a"))
-assert evaluate(Record(4, "a"))
+assert not evaluate(_record_simple(2, "a"))
+assert evaluate(_record_simple(3, "a"))
+assert evaluate(_record_simple(4, "a"))
 
 
 def test_equal_to() -> None:
 evaluate = expression_evaluator(SIMPLE_SCHEMA, EqualTo("id", 3), 
case_sensitive=True)
-assert not evaluate(Record(2, "a"))
-assert evaluate(Record(3, "a"))
-assert not evaluate(Record(4, "a"))
+assert not evaluate(_record_simple(2, "a"))
+assert evaluate(_record_simple(3, "a"))
+assert not evaluate(_record_simple(4, "a"))
 
 
 def test_not_equal_to() -> None:
 evaluate = expression_evaluator(SIMPLE_SCHEMA, NotEqualTo("id", 3), 
case_sensitive=True)
-assert evaluate(Record(2, "a"))
-assert not evaluate(Record(3, "a"))
-assert evaluate(Record(4, "a"))
+assert evaluate(_record_simple(2, "a"))
+assert not evaluate(_record_simple(3, "a"))
+assert evaluate(_record_simple(4, "a"))
 
 
 def test_in() -> None:
 evaluate = expression_evaluator(SIMPLE_SCHEMA, In("id", [1, 2, 3]), 
case_sensitive=True)
-assert evaluate(Record(2, "a"))
-assert evaluate(Record(3, "a"))
-assert not evaluate(Record(4, "a"))
+assert evaluate(_record_simple(2, "a"))
+assert evaluate(_record_simple(3, "a"))
+assert not evaluate(_record_simple(4, "a"))
 
 
 def test_not_in() -> None:
 evaluate = expression_evaluator(SIMPLE_SCHEMA, NotIn("id", [1, 2, 3]), 
case_sensitive=True)
-assert not evaluate(Record(2, "a"))
-assert not evaluate(Record(3, "a"))
-assert evaluate(Record(4, "a"))
+assert not evaluate(_record_simple(2, "a"))
+assert not evaluate(_record_simple(3, "a"))
+assert evaluate(_record_simple(4, "a"))
 
 
 def test_is_null() -> None:
 evaluate = expression_evaluator(SIMPLE_SCHEMA, IsNull("data"), 
case_sensitive=True)
-assert not evaluate(Record(2, "a"))
-assert evaluate(Record(3, None))
+assert not evaluate(_record_simple(2, "a"))
+assert evaluate(_record_simple(3, None))
 
 
 def test_not_null() -> None:
 evaluate = expression_evaluator(SIMPLE_SCHEMA, NotNull("data"), 
case_sensitive=True)
-assert evaluate(Record(2, "a"))
-assert not evaluate(Record(3, None))
+assert evaluate(_record_simple(2, "a"))
+assert not evaluate(_record_simple(3, None))
 
 
 def test_is_nan() -> None:
 evaluate = expression_evaluator(FLOAT_SCHEMA, IsNaN("f"), 
case_sensitive=True)
-assert not evaluate(Record(2, 0.0))
-assert not evaluate(Record(3, float("infinity")))
-assert evaluate(Record(4, float("nan")))
+assert not evaluate(_record_float(2, f=0.0))
+assert not evaluate(_record_float(3, f=float("in

[GitHub] [iceberg] jackye1995 commented on pull request #6576: AWS: Fix check for isTableRegisteredWithLF leading to CREATE table failure

2023-01-13 Thread GitBox


jackye1995 commented on PR #6576:
URL: https://github.com/apache/iceberg/pull/6576#issuecomment-1382243877

   @xiaoxuandev can you take a look? I believe this case should have been 
covered in unit test, need to take a deeper look into it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] huaxingao opened a new pull request, #6582: Add a Spark procedure to collect NDV

2023-01-13 Thread GitBox


huaxingao opened a new pull request, #6582:
URL: https://github.com/apache/iceberg/pull/6582

   Add a Spark procedure to collect NDV, which will be used for CBO.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

2023-01-13 Thread GitBox


huaxingao commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1069901978


##
core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java:
##
@@ -26,4 +26,6 @@ private StandardBlobTypes() {}
* href="https://datasketches.apache.org/";>Apache DataSketches library
*/
   public static final String APACHE_DATASKETCHES_THETA_V1 = 
"apache-datasketches-theta-v1";
+
+  public static final String NDV_BLOB = "ndv-blob";

Review Comment:
   Spark doesn't use Apache DataSketches to collect approximate NDV, so I am 
adding a new blob type. Hope this is OK. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

2023-01-13 Thread GitBox


huaxingao commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1069914737


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java:
##
@@ -0,0 +1,191 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.procedures;
+
+import static java.lang.String.format;
+import static java.util.UUID.randomUUID;
+import static 
org.apache.hadoop.shaded.com.google.common.collect.ImmutableList.toImmutableList;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.nio.ByteBuffer;
+import org.apache.iceberg.GenericBlobMetadata;
+import org.apache.iceberg.GenericStatisticsFile;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.puffin.Blob;
+import org.apache.iceberg.puffin.Puffin;
+import org.apache.iceberg.puffin.PuffinWriter;
+import org.apache.iceberg.puffin.StandardBlobTypes;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.source.SparkTable;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A procedure that gets approximate NDV (number of distinct value) for the 
requested columns
+ * and sets this to the table's StatisticsFile.
+ */

Review Comment:
   I am debating myself if I should collect ndv only or also collect everything 
else such as max, min, num_nulls etc. in ANALYZE TABLE. I will just collect ndv 
for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

2023-01-13 Thread GitBox


huaxingao commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1069918267


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java:
##
@@ -0,0 +1,191 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.procedures;
+
+import static java.lang.String.format;
+import static java.util.UUID.randomUUID;
+import static 
org.apache.hadoop.shaded.com.google.common.collect.ImmutableList.toImmutableList;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.nio.ByteBuffer;
+import org.apache.iceberg.GenericBlobMetadata;
+import org.apache.iceberg.GenericStatisticsFile;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.puffin.Blob;
+import org.apache.iceberg.puffin.Puffin;
+import org.apache.iceberg.puffin.PuffinWriter;
+import org.apache.iceberg.puffin.StandardBlobTypes;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.source.SparkTable;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A procedure that gets approximate NDV (number of distinct value) for the 
requested columns
+ * and sets this to the table's StatisticsFile.
+ */
+public class DistinctCountProcedure extends BaseProcedure {
+  private static final Logger LOG = 
LoggerFactory.getLogger(DistinctCountProcedure.class);
+
+  private static final ProcedureParameter[] PARAMETERS =
+  new ProcedureParameter[] {
+ProcedureParameter.required("table", DataTypes.StringType),
+ProcedureParameter.optional("distinct_count_view", 
DataTypes.StringType),
+ProcedureParameter.optional("columns", STRING_ARRAY),
+  };
+
+  private static final StructType OUTPUT_TYPE =
+  new StructType(
+  new StructField[] {
+new StructField("view_name", DataTypes.StringType, false, 
Metadata.empty())
+  });
+
+  public static SparkProcedures.ProcedureBuilder builder() {
+return new Builder() {
+  @Override
+  protected DistinctCountProcedure doBuild() {
+return new DistinctCountProcedure(tableCatalog());
+  }
+};
+  }
+
+  private DistinctCountProcedure(TableCatalog tableCatalog) {
+super(tableCatalog);
+  }
+
+  @Override
+  public ProcedureParameter[] parameters() {
+return PARAMETERS;
+  }
+
+  @Override
+  public StructType outputType() {
+return OUTPUT_TYPE;
+  }
+
+  @Override
+  public InternalRow[] call(InternalRow args) {
+String tableName = args.getString(0);
+Identifier tableIdent = toIdentifier(tableName, PARAMETERS[0].name());
+SparkTable sparkTable = loadSparkTable(tableIdent);
+StructType schema = sparkTable.schema();
+Table table = sparkTable.table();
+ArrayData columns = args.getArray(2);
+int columnSizes = columns.numElements();
+
+long[] ndvs = new long[columnSizes];
+int[] fieldId = new int[columnSizes];
+String query = "SELECT ";
+for (int i = 0; i < columnSizes; i++) {
+  String colName = columns.getUTF8String(i).toString();
+  query += "APPROX_COUNT_DISTINCT(" + colName + "), ";
+  fieldId[i] = schema.fieldIndex(colName);
+}
+
+query = query.substring(0, query.length() - 2) + " FROM " + tableName;
+Data

[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables

2023-01-13 Thread GitBox


jackye1995 commented on code in PR #6428:
URL: https://github.com/apache/iceberg/pull/6428#discussion_r1069937682


##
snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeTableMetadata.java:
##
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.snowflake;
+
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import org.apache.iceberg.relocated.com.google.common.base.Objects;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.util.JsonUtil;
+
+class SnowflakeTableMetadata {
+  public static final Pattern SNOWFLAKE_AZURE_PATTERN =
+  Pattern.compile("azure://([^/]+)/([^/]+)/(.*)");
+
+  private final String snowflakeMetadataLocation;
+  private final String icebergMetadataLocation;
+  private final String status;
+
+  // Note: Since not all sources will necessarily come from a raw JSON 
representation, this raw
+  // JSON should only be considered a convenient debugging field. Equality of 
two
+  // SnowflakeTableMetadata instances should not depend on equality of this 
field.
+  private final String rawJsonVal;
+
+  SnowflakeTableMetadata(
+  String snowflakeMetadataLocation,
+  String icebergMetadataLocation,
+  String status,
+  String rawJsonVal) {
+this.snowflakeMetadataLocation = snowflakeMetadataLocation;
+this.icebergMetadataLocation = icebergMetadataLocation;
+this.status = status;
+this.rawJsonVal = rawJsonVal;
+  }
+
+  /** Storage location of table metadata in Snowflake's path syntax. */
+  public String snowflakeMetadataLocation() {
+return snowflakeMetadataLocation;
+  }
+
+  /** Storage location of table metadata in Iceberg's path syntax. */
+  public String icebergMetadataLocation() {
+return icebergMetadataLocation;
+  }
+
+  public String getStatus() {
+return status;
+  }
+
+  @Override
+  public boolean equals(Object o) {
+if (this == o) {
+  return true;
+} else if (!(o instanceof SnowflakeTableMetadata)) {
+  return false;
+}
+
+// Only consider parsed fields, not the raw JSON that may or may not be 
the original source of
+// this instance.
+SnowflakeTableMetadata that = (SnowflakeTableMetadata) o;
+return Objects.equal(this.snowflakeMetadataLocation, 
that.snowflakeMetadataLocation)
+&& Objects.equal(this.icebergMetadataLocation, 
that.icebergMetadataLocation)
+&& Objects.equal(this.status, that.status);
+  }
+
+  @Override
+  public int hashCode() {
+return Objects.hashCode(snowflakeMetadataLocation, 
icebergMetadataLocation, status);
+  }
+
+  @Override
+  public String toString() {
+return String.format(
+"snowflakeMetadataLocation: '%s', icebergMetadataLocation: '%s', 
status: '%s'",
+snowflakeMetadataLocation, icebergMetadataLocation, status);
+  }
+
+  public String toDebugString() {
+return String.format("%s, rawJsonVal: %s", toString(), rawJsonVal);
+  }
+
+  /**
+   * Translates from Snowflake's path syntax to Iceberg's path syntax for 
paths matching known
+   * non-compatible Snowflake paths. Throws IllegalArgumentException if the 
prefix of the
+   * snowflakeLocation is a known non-compatible path syntax but fails to 
match the expected path
+   * components for a successful translation.
+   */
+  public static String snowflakeLocationToIcebergLocation(String 
snowflakeLocation) {

Review Comment:
   Thanks for the clarification, I approved the changes, very excited to see 
this from Snowflake.
   
   >  I've been meaning to open a discussion to see if anyone has thought about 
maybe adding some ease-of-use hooks for last-mile automatic basic path 
translations right before they go to FileIO resolution.
   
   For HadoopFileIO, i think this is not really needed because typically we see 
our users already configured HDFS settings to map schemes to whatever file 
system implementations they would like to use.
   
   I think ResolvingFileIO to some extent already does this kind of translation 
to some extent, maybe we can extend its functionality at that front.



-- 
This is

[GitHub] [iceberg] JonasJ-ap opened a new pull request, #6179: AWS: Re-tag files when renaming tables in GlueCatalog

2023-01-13 Thread GitBox


JonasJ-ap opened a new pull request, #6179:
URL: https://github.com/apache/iceberg/pull/6179

   Follows PR #4402 . 
   As mentioned in 
https://github.com/apache/iceberg/pull/4402#issuecomment-1261096282:
   In `GlueCatalog`, if `s3.write.table-name-tag-enabled` and 
`s3.write.namespace-name-tag-enabled` are enabled, all the files related to the 
table will be tagged with the table name and the namespace name. We need to 
update these tags when we rename the table.
   
   This PR add S3 tag updates for `renameTable` operation in `GlueCatalog`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on pull request #6179: AWS: Re-tag files when renaming tables in GlueCatalog

2023-01-13 Thread GitBox


jackye1995 commented on PR #6179:
URL: https://github.com/apache/iceberg/pull/6179#issuecomment-1382318184

   I'd like to discuss this a bit more, since we do have some actual customer 
use cases for this, because overall the S3 tagging related features in Iceberg 
integrate very well with S3 lifecycle policy and bucket permission, RENAME 
seems to be the only piece that is missing for the whole end to end flow to 
work.
   
   > I think that if a rename requires anything but a metadata operation, then 
the Iceberg catalog should not allow it and should throw an exception that 
rename is not supported.
   
   Totally agree. The reason I was relatively not against this approach was 
that RENAME is not a frequent operation for most users, and they are willing to 
wait long to complete the full rename if it can update related tags or other 
metadata information in storage.
   
   > Users aren't going to know that this is needs to perform a potentially 
huge number of S3 operations. 
   
   Correct, maybe one way to resolve this is to throw exception by default and 
user has to turn on the feature to be aware of the implications.
   
   This logic will probably exist outside Iceberg repository if it is not 
contributed in. @rdblue please let me know if there is any intermediate grounds 
that could be taken for this use case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

2023-01-13 Thread GitBox


RussellSpitzer commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070017872


##
core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java:
##
@@ -26,4 +26,6 @@ private StandardBlobTypes() {}
* href="https://datasketches.apache.org/";>Apache DataSketches library
*/
   public static final String APACHE_DATASKETCHES_THETA_V1 = 
"apache-datasketches-theta-v1";
+
+  public static final String NDV_BLOB = "ndv-blob";

Review Comment:
   @findepi What are you using for NDV stats here? I figure we should have a 
common blob type



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

2023-01-13 Thread GitBox


RussellSpitzer commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070020289


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java:
##
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.procedures;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.nio.ByteBuffer;
+import java.util.UUID;
+import org.apache.iceberg.GenericBlobMetadata;
+import org.apache.iceberg.GenericStatisticsFile;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.puffin.Blob;
+import org.apache.iceberg.puffin.Puffin;
+import org.apache.iceberg.puffin.PuffinWriter;
+import org.apache.iceberg.puffin.StandardBlobTypes;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.source.SparkTable;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A procedure that gets approximate NDV (number of distinct value) for the 
requested columns and
+ * sets this to the table's StatisticsFile.
+ */
+public class DistinctCountProcedure extends BaseProcedure {
+  private static final Logger LOG = 
LoggerFactory.getLogger(DistinctCountProcedure.class);
+
+  private static final ProcedureParameter[] PARAMETERS =
+  new ProcedureParameter[] {
+ProcedureParameter.required("table", DataTypes.StringType),
+ProcedureParameter.optional("distinct_count_view", 
DataTypes.StringType),
+ProcedureParameter.optional("columns", STRING_ARRAY),
+  };
+
+  private static final StructType OUTPUT_TYPE =
+  new StructType(
+  new StructField[] {
+new StructField("view_name", DataTypes.StringType, false, 
Metadata.empty())
+  });
+
+  public static SparkProcedures.ProcedureBuilder builder() {
+return new Builder() {
+  @Override
+  protected DistinctCountProcedure doBuild() {
+return new DistinctCountProcedure(tableCatalog());
+  }
+};
+  }
+
+  private DistinctCountProcedure(TableCatalog tableCatalog) {
+super(tableCatalog);
+  }
+
+  @Override
+  public ProcedureParameter[] parameters() {
+return PARAMETERS;
+  }
+
+  @Override
+  public StructType outputType() {
+return OUTPUT_TYPE;
+  }
+
+  @Override
+  public InternalRow[] call(InternalRow args) {
+String tableName = args.getString(0);
+Identifier tableIdent = toIdentifier(tableName, PARAMETERS[0].name());
+SparkTable sparkTable = loadSparkTable(tableIdent);
+StructType schema = sparkTable.schema();
+Table table = sparkTable.table();
+ArrayData columns = args.getArray(2);
+int columnSizes = columns.numElements();
+
+long[] ndvs = new long[columnSizes];
+int[] fieldId = new int[columnSizes];
+String query = "SELECT ";
+for (int i = 0; i < columnSizes; i++) {
+  String colName = columns.getUTF8String(i).toString();
+  query += "APPROX_COUNT_DISTINCT(" + colName + "), ";

Review Comment:
   Since we are technically not using distinct here, maybe we should be calling 
the procedure "analyze"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

2023-01-13 Thread GitBox


RussellSpitzer commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070023914


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java:
##
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.procedures;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.nio.ByteBuffer;
+import java.util.UUID;
+import org.apache.iceberg.GenericBlobMetadata;
+import org.apache.iceberg.GenericStatisticsFile;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.puffin.Blob;
+import org.apache.iceberg.puffin.Puffin;
+import org.apache.iceberg.puffin.PuffinWriter;
+import org.apache.iceberg.puffin.StandardBlobTypes;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.source.SparkTable;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A procedure that gets approximate NDV (number of distinct value) for the 
requested columns and
+ * sets this to the table's StatisticsFile.
+ */
+public class DistinctCountProcedure extends BaseProcedure {
+  private static final Logger LOG = 
LoggerFactory.getLogger(DistinctCountProcedure.class);
+
+  private static final ProcedureParameter[] PARAMETERS =
+  new ProcedureParameter[] {
+ProcedureParameter.required("table", DataTypes.StringType),
+ProcedureParameter.optional("distinct_count_view", 
DataTypes.StringType),
+ProcedureParameter.optional("columns", STRING_ARRAY),
+  };
+
+  private static final StructType OUTPUT_TYPE =
+  new StructType(
+  new StructField[] {
+new StructField("view_name", DataTypes.StringType, false, 
Metadata.empty())
+  });
+
+  public static SparkProcedures.ProcedureBuilder builder() {
+return new Builder() {
+  @Override
+  protected DistinctCountProcedure doBuild() {
+return new DistinctCountProcedure(tableCatalog());
+  }
+};
+  }
+
+  private DistinctCountProcedure(TableCatalog tableCatalog) {
+super(tableCatalog);
+  }
+
+  @Override
+  public ProcedureParameter[] parameters() {
+return PARAMETERS;
+  }
+
+  @Override
+  public StructType outputType() {
+return OUTPUT_TYPE;
+  }
+
+  @Override
+  public InternalRow[] call(InternalRow args) {
+String tableName = args.getString(0);
+Identifier tableIdent = toIdentifier(tableName, PARAMETERS[0].name());
+SparkTable sparkTable = loadSparkTable(tableIdent);
+StructType schema = sparkTable.schema();
+Table table = sparkTable.table();
+ArrayData columns = args.getArray(2);
+int columnSizes = columns.numElements();
+
+long[] ndvs = new long[columnSizes];
+int[] fieldId = new int[columnSizes];
+String query = "SELECT ";
+for (int i = 0; i < columnSizes; i++) {
+  String colName = columns.getUTF8String(i).toString();
+  query += "APPROX_COUNT_DISTINCT(" + colName + "), ";
+  fieldId[i] = schema.fieldIndex(colName);
+}
+
+query = query.substring(0, query.length() - 2) + " FROM " + tableName;
+Dataset df = spark().sql(query);
+
+for (int i = 0; i < columnSizes; i++) {
+  ndvs[i] = df.head().getLong(i);
+}
+
+TableOperations operatio

[GitHub] [iceberg] RussellSpitzer commented on pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

2023-01-13 Thread GitBox


RussellSpitzer commented on PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#issuecomment-1382374860

   I'm still a little worried about saving this information, what does knowing 
the sort order mean for a file. Are we guaranteeing that the file is locally 
sorted by that order? or globally sorted with files added in that snapshot. I'm 
cautious about doing anything here until we really define what it means.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] pvary opened a new pull request, #6583: Flink: Refactor sink tests to use HadoopCatalogResource

2023-01-13 Thread GitBox


pvary opened a new pull request, #6583:
URL: https://github.com/apache/iceberg/pull/6583

   Refactor Flink Sink tests to use the HadoopCatalogResource.
   This is a groundwork for adding encryption tests for Flink Sources and Sinks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] pvary commented on pull request #6583: Flink: Refactor sink tests to use HadoopCatalogResource

2023-01-13 Thread GitBox


pvary commented on PR #6583:
URL: https://github.com/apache/iceberg/pull/6583#issuecomment-1382415642

   CC: @hililiwei, @ggershinsky  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6581: Spark 3.3: Add RemoveDanglingDeletes action

2023-01-13 Thread GitBox


szehon-ho commented on code in PR #6581:
URL: https://github.com/apache/iceberg/pull/6581#discussion_r1070090624


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java:
##
@@ -0,0 +1,227 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.actions;
+
+import static org.apache.iceberg.MetadataTableType.ENTRIES;
+import static org.apache.spark.sql.functions.min;
+
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.DeleteFiles;
+import org.apache.iceberg.FileMetadata;
+import org.apache.iceberg.PartitionData;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Partitioning;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.actions.RemoveDanglingDeleteFiles;
+import org.apache.iceberg.actions.RemoveDanglingDeleteFilesActionResult;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.JobGroupInfo;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.api.java.function.MapFunction;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema;
+
+/**
+ * An action that removes dangling delete files from the current snapshot. A 
delete file is dangling
+ * if its deletes no longer applies to any data file.
+ *
+ * The following dangling delete files are removed:
+ *
+ * 
+ *   Position delete files with a sequence number less than that of any 
data file in the same
+ *   partition
+ *   Equality delete files with a sequence number less than or equal to 
that of any data file in
+ *   the same partition
+ * 
+ */
+public class RemoveDanglingDeletesSparkAction
+extends BaseSnapshotUpdateSparkAction
+implements RemoveDanglingDeleteFiles {
+
+  private final Table table;
+
+  protected RemoveDanglingDeletesSparkAction(SparkSession spark, Table table) {
+super(spark);
+this.table = table;
+  }
+
+  @Override
+  protected RemoveDanglingDeletesSparkAction self() {
+return this;
+  }
+
+  @Override
+  public Result execute() {
+if (table.specs().size() == 1 && table.spec().isUnpartitioned()) {
+  // ManifestFilterManager already performs this table-wide on each commit
+  return RemoveDanglingDeleteFilesActionResult.empty();
+}
+
+String desc = String.format("Remove dangling delete files for %s", 
table.name());
+JobGroupInfo info = newJobGroupInfo("REWRITE-MANIFESTS", desc);
+return withJobGroupInfo(info, this::doExecute);
+  }
+
+  private Result doExecute() {
+Dataset entries =
+loadMetadataTable(table, ENTRIES)
+.filter("status < 2") // live entries
+.selectExpr(
+"data_file.partition as partition",
+"data_file.spec_id as spec_id",
+"data_file.file_path as file_path",
+"data_file.content as content",
+"data_file.file_size_in_bytes as file_size_in_bytes",
+"data_file.record_count as record_count",
+"sequence_number");
+
+DeleteFiles deleteFiles = table.newDelete();
+List toRemove = withReusableDS(entries, this::danglingDeletes);
+toRemove.forEach(deleteFiles::deleteFile);
+deleteFiles.commit();
+
+return new RemoveDanglingDeleteFilesActionResult(toRemove);
+  }
+
+  /**
+   * Calculate dangling delete files
+   *
+   * 
+   *   Group all files by partition, calculate the minimum data file 
sequence number in each.
+   *   For each partition, check if any position delete files have a 
sequence number less than
+   *   the partition's min_data_sequence_number
+   *   For each partition, check if any equality delete files have a 
sequence number less than
+   *   or equal to the partition's min_data_sequence_number
+   * 
+   *
+   * @param entries dataset of file entries, marked by content (0 for data, 1 
for posDeletes, 2 for
+   * eqDeletes)
+   * @return list of dangling delete f

[GitHub] [iceberg] huaxingao commented on issue #6549: Collecting Iceberg NDV Statistics for Spark Engine

2023-01-13 Thread GitBox


huaxingao commented on issue #6549:
URL: https://github.com/apache/iceberg/issues/6549#issuecomment-1382430906

   Here is the [PR](https://github.com/apache/iceberg/pull/6582) for 
implementing a Spark stored procedure to collect NDV.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

2023-01-13 Thread GitBox


huaxingao commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070099281


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java:
##
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.procedures;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.nio.ByteBuffer;
+import java.util.UUID;
+import org.apache.iceberg.GenericBlobMetadata;
+import org.apache.iceberg.GenericStatisticsFile;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.puffin.Blob;
+import org.apache.iceberg.puffin.Puffin;
+import org.apache.iceberg.puffin.PuffinWriter;
+import org.apache.iceberg.puffin.StandardBlobTypes;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.source.SparkTable;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A procedure that gets approximate NDV (number of distinct value) for the 
requested columns and
+ * sets this to the table's StatisticsFile.
+ */
+public class DistinctCountProcedure extends BaseProcedure {
+  private static final Logger LOG = 
LoggerFactory.getLogger(DistinctCountProcedure.class);
+
+  private static final ProcedureParameter[] PARAMETERS =
+  new ProcedureParameter[] {
+ProcedureParameter.required("table", DataTypes.StringType),
+ProcedureParameter.optional("distinct_count_view", 
DataTypes.StringType),
+ProcedureParameter.optional("columns", STRING_ARRAY),
+  };
+
+  private static final StructType OUTPUT_TYPE =
+  new StructType(
+  new StructField[] {
+new StructField("view_name", DataTypes.StringType, false, 
Metadata.empty())
+  });
+
+  public static SparkProcedures.ProcedureBuilder builder() {
+return new Builder() {
+  @Override
+  protected DistinctCountProcedure doBuild() {
+return new DistinctCountProcedure(tableCatalog());
+  }
+};
+  }
+
+  private DistinctCountProcedure(TableCatalog tableCatalog) {
+super(tableCatalog);
+  }
+
+  @Override
+  public ProcedureParameter[] parameters() {
+return PARAMETERS;
+  }
+
+  @Override
+  public StructType outputType() {
+return OUTPUT_TYPE;
+  }
+
+  @Override
+  public InternalRow[] call(InternalRow args) {
+String tableName = args.getString(0);
+Identifier tableIdent = toIdentifier(tableName, PARAMETERS[0].name());
+SparkTable sparkTable = loadSparkTable(tableIdent);
+StructType schema = sparkTable.schema();
+Table table = sparkTable.table();
+ArrayData columns = args.getArray(2);
+int columnSizes = columns.numElements();
+
+long[] ndvs = new long[columnSizes];
+int[] fieldId = new int[columnSizes];
+String query = "SELECT ";
+for (int i = 0; i < columnSizes; i++) {
+  String colName = columns.getUTF8String(i).toString();
+  query += "APPROX_COUNT_DISTINCT(" + colName + "), ";

Review Comment:
   I will change the procedure name from `DistinctCountProcedure` to 
`AnalyzeTableProcedure`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-m

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


amogh-jahagirdar commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070101248


##
core/src/main/java/org/apache/iceberg/view/ViewHistoryEntryParser.java:
##
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.util.JsonUtil;
+
+class ViewHistoryEntryParser {
+
+  private ViewHistoryEntryParser() {}
+
+  static final String VERSION_ID = "version-id";
+  static final String TIMESTAMP_MS = "timestamp-ms";
+
+  static String toJson(ViewHistoryEntry entry) {
+return JsonUtil.generate(gen -> toJson(entry, gen), false);
+  }
+
+  static void toJson(ViewHistoryEntry entry, JsonGenerator generator) throws 
IOException {
+generator.writeStartObject();
+generator.writeNumberField(TIMESTAMP_MS, entry.timestampMillis());
+generator.writeNumberField(VERSION_ID, entry.versionId());
+generator.writeEndObject();
+  }
+
+  static ViewHistoryEntry fromJson(String json) {
+Preconditions.checkArgument(
+json != null && !json.isEmpty(),
+"Cannot parse view history entry from invalid JSON: %s",
+json);
+return JsonUtil.parse(json, ViewHistoryEntryParser::fromJson);
+  }
+
+  static ViewHistoryEntry fromJson(JsonNode node) {
+return new BaseViewHistoryEntry(

Review Comment:
   Separated the precondition checks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


amogh-jahagirdar commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070101425


##
core/src/test/java/org/apache/iceberg/view/TestViewHistoryEntryParser.java:
##
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import org.junit.Assert;
+import org.junit.Test;
+
+public class TestViewHistoryEntryParser {
+
+  @Test

Review Comment:
   Good point, I've added more tests around null,empty, missing fields 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


amogh-jahagirdar commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070101669


##
core/src/main/java/org/apache/iceberg/view/ViewHistoryEntryParser.java:
##
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.util.JsonUtil;
+
+class ViewHistoryEntryParser {
+
+  private ViewHistoryEntryParser() {}
+
+  static final String VERSION_ID = "version-id";
+  static final String TIMESTAMP_MS = "timestamp-ms";
+
+  static String toJson(ViewHistoryEntry entry) {
+return JsonUtil.generate(gen -> toJson(entry, gen), false);
+  }
+
+  static void toJson(ViewHistoryEntry entry, JsonGenerator generator) throws 
IOException {
+generator.writeStartObject();

Review Comment:
   Sure, updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on pull request #6559: Core: View core parser implementations

2023-01-13 Thread GitBox


amogh-jahagirdar commented on PR #6559:
URL: https://github.com/apache/iceberg/pull/6559#issuecomment-1382437507

   Thanks @nastra I'll be taking these suggestions in all the split PRs I'm 
raising. Agreed, more tests on nullability/missing fields are needed, and now 
that we use Immutable dependency in the metrics implementation, we have a good 
precedent to use it here as well which will simplify a lot of the boilerplate 
code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar closed pull request #6559: Core: View core parser implementations

2023-01-13 Thread GitBox


amogh-jahagirdar closed pull request #6559: Core: View core parser 
implementations
URL: https://github.com/apache/iceberg/pull/6559


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


rdblue commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070122674


##
core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java:
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import org.immutables.value.Value;
+
+@Value.Immutable
+public interface BaseViewHistoryEntry extends ViewHistoryEntry {

Review Comment:
   I think we can drop `Base` since this isn't a base class. We use that when 
we expect something to extend the class. But this is an interface that will 
result in an `ImmutableViewHistoryEntry`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


rdblue commented on PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#issuecomment-1382456468

   Looks good other than the name of the history entry interface. Thanks, 
@amogh-jahagirdar!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] flyrain commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

2023-01-13 Thread GitBox


flyrain commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070140194


##
core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java:
##
@@ -26,4 +26,6 @@ private StandardBlobTypes() {}
* href="https://datasketches.apache.org/";>Apache DataSketches library
*/
   public static final String APACHE_DATASKETCHES_THETA_V1 = 
"apache-datasketches-theta-v1";
+
+  public static final String NDV_BLOB = "ndv-blob";

Review Comment:
   We should use one blob type for NDV ideally, although Spark doesn't have the 
sketch data. I'm also curious how sketch data is useful for a table level 
metric. It is absolutely useful for file-level and partition-level since we can 
merge them later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] stevenzwu opened a new pull request, #6584: Flink: support reading as Avro GenericRecord for FLIP-27 IcebergSource

2023-01-13 Thread GitBox


stevenzwu opened a new pull request, #6584:
URL: https://github.com/apache/iceberg/pull/6584

   cc @hililiwei 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


amogh-jahagirdar commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070149483


##
core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java:
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import org.immutables.value.Value;
+
+@Value.Immutable
+public interface BaseViewHistoryEntry extends ViewHistoryEntry {

Review Comment:
   Sure let me see about a better name, the underlying interface is already 
ViewHistoryEntry. I think technically we could just need to add the immutable 
to ViewHistoryEntry but maybe its better to have an explicit core 
implementation which uses the annotation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


amogh-jahagirdar commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070149483


##
core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java:
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import org.immutables.value.Value;
+
+@Value.Immutable
+public interface BaseViewHistoryEntry extends ViewHistoryEntry {

Review Comment:
   Sure let me see about a better name, the underlying interface is already 
ViewHistoryEntry. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

2023-01-13 Thread GitBox


szehon-ho commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070150005


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java:
##
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.procedures;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.nio.ByteBuffer;
+import java.util.UUID;
+import org.apache.iceberg.GenericBlobMetadata;
+import org.apache.iceberg.GenericStatisticsFile;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.puffin.Blob;
+import org.apache.iceberg.puffin.Puffin;
+import org.apache.iceberg.puffin.PuffinWriter;
+import org.apache.iceberg.puffin.StandardBlobTypes;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.source.SparkTable;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A procedure that gets approximate NDV (number of distinct value) for the 
requested columns and
+ * sets this to the table's StatisticsFile.
+ */
+public class DistinctCountProcedure extends BaseProcedure {
+  private static final Logger LOG = 
LoggerFactory.getLogger(DistinctCountProcedure.class);
+
+  private static final ProcedureParameter[] PARAMETERS =
+  new ProcedureParameter[] {
+ProcedureParameter.required("table", DataTypes.StringType),
+ProcedureParameter.optional("distinct_count_view", 
DataTypes.StringType),
+ProcedureParameter.optional("columns", STRING_ARRAY),
+  };
+
+  private static final StructType OUTPUT_TYPE =
+  new StructType(
+  new StructField[] {
+new StructField("view_name", DataTypes.StringType, false, 
Metadata.empty())
+  });
+
+  public static SparkProcedures.ProcedureBuilder builder() {
+return new Builder() {
+  @Override
+  protected DistinctCountProcedure doBuild() {
+return new DistinctCountProcedure(tableCatalog());
+  }
+};
+  }
+
+  private DistinctCountProcedure(TableCatalog tableCatalog) {
+super(tableCatalog);
+  }
+
+  @Override
+  public ProcedureParameter[] parameters() {
+return PARAMETERS;
+  }
+
+  @Override
+  public StructType outputType() {
+return OUTPUT_TYPE;
+  }
+
+  @Override
+  public InternalRow[] call(InternalRow args) {
+String tableName = args.getString(0);
+Identifier tableIdent = toIdentifier(tableName, PARAMETERS[0].name());
+SparkTable sparkTable = loadSparkTable(tableIdent);
+StructType schema = sparkTable.schema();
+Table table = sparkTable.table();
+ArrayData columns = args.getArray(2);
+int columnSizes = columns.numElements();
+
+long[] ndvs = new long[columnSizes];
+int[] fieldId = new int[columnSizes];
+String query = "SELECT ";
+for (int i = 0; i < columnSizes; i++) {
+  String colName = columns.getUTF8String(i).toString();
+  query += "APPROX_COUNT_DISTINCT(" + colName + "), ";
+  fieldId[i] = schema.fieldIndex(colName);
+}
+
+query = query.substring(0, query.length() - 2) + " FROM " + tableName;
+Dataset df = spark().sql(query);
+
+for (int i = 0; i < columnSizes; i++) {
+  ndvs[i] = df.head().getLong(i);
+}
+
+TableOperations operations = 

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


amogh-jahagirdar commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070149483


##
core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java:
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import org.immutables.value.Value;
+
+@Value.Immutable
+public interface BaseViewHistoryEntry extends ViewHistoryEntry {

Review Comment:
   Sure let me see about a better name, the underlying interface is already 
ViewHistoryEntry.  
   
   ViewLogEntry comes to mind. This is analagous to SnapshotLogEntry 
implementing HistoryEntry for tables. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


amogh-jahagirdar commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070149483


##
core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java:
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import org.immutables.value.Value;
+
+@Value.Immutable
+public interface BaseViewHistoryEntry extends ViewHistoryEntry {

Review Comment:
   Sure let me see about a better name, the underlying interface is already 
ViewHistoryEntry.  
   
   ViewLogEntry comes to mind. This is analogous to SnapshotLogEntry 
implementing HistoryEntry for tables. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables

2023-01-13 Thread GitBox


rdblue commented on code in PR #6428:
URL: https://github.com/apache/iceberg/pull/6428#discussion_r1070158086


##
snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java:
##
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.snowflake;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.CatalogProperties;
+import org.apache.iceberg.CatalogUtil;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.SupportsNamespaces;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.NoSuchNamespaceException;
+import org.apache.iceberg.hadoop.Configurable;
+import org.apache.iceberg.io.CloseableGroup;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.jdbc.JdbcClientPool;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class SnowflakeCatalog extends BaseMetastoreCatalog
+implements Closeable, SupportsNamespaces, Configurable {
+  private static final String DEFAULT_CATALOG_NAME = "snowflake_catalog";
+  private static final String DEFAULT_FILE_IO_IMPL = 
"org.apache.iceberg.io.ResolvingFileIO";
+
+  static class FileIOFactory {
+public FileIO newFileIO(String impl, Map properties, 
Object hadoopConf) {
+  return CatalogUtil.loadFileIO(impl, properties, hadoopConf);

Review Comment:
   This factory seems odd since it has no state that is passed through to the 
`loadFileIO` method. Couldn't this just call `loadFileIO` directly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6573: Docs: Add information on how to read from branches and tags in Spark docs

2023-01-13 Thread GitBox


amogh-jahagirdar commented on code in PR #6573:
URL: https://github.com/apache/iceberg/pull/6573#discussion_r1070158484


##
docs/spark-queries.md:
##
@@ -126,6 +126,8 @@ To select a specific table snapshot or the snapshot at some 
time in the DataFram
 
 * `snapshot-id` selects a specific table snapshot
 * `as-of-timestamp` selects the current snapshot at a timestamp, in 
milliseconds
+* `branch` selects the head snapshot of the specified branch. Note that 
currently branch cannot be combined with as-of-timestamp.
+* `tag` selects the snapshot associated with the specified tag

Review Comment:
   Definitely agree on having a SQL example once #6575 gets merged. For 
combining as-of-timestamp with tag I felt that was apparent since a tag can 
only map to a single snapshot which conflicts with passing in a timestamp, 
where as branch + time travel is a different case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


amogh-jahagirdar commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070149483


##
core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java:
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import org.immutables.value.Value;
+
+@Value.Immutable
+public interface BaseViewHistoryEntry extends ViewHistoryEntry {

Review Comment:
   Sure let me see about a better name, the underlying interface is already 
ViewHistoryEntry.  I would add to the underlying interface directly but then 
that changes the class path for implementation to be in API which goes against 
the existing pattern? 
   
   ViewLogEntry comes to mind. This is analogous to SnapshotLogEntry 
implementing HistoryEntry for tables. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


jackye1995 commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070159167


##
core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java:
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import org.immutables.value.Value;
+
+@Value.Immutable
+public interface BaseViewHistoryEntry extends ViewHistoryEntry {

Review Comment:
   maybe a dumb question, why don’t we just add immutable annotation on the 
interface in that case?
   
   ```
   @Value.Immutable
   public interface ViewLogEntry extends ViewHistoryEntry {
 long timestampMillis();
   
 int versionId();
   }
   ```
   
   ```
   public interface ViewHistoryEntry {
 /** Return the timestamp in milliseconds of the change */
 long timestampMillis();
   
 /** Return ID of the new current version */
 int versionId();
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


jackye1995 commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070159167


##
core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java:
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import org.immutables.value.Value;
+
+@Value.Immutable
+public interface BaseViewHistoryEntry extends ViewHistoryEntry {

Review Comment:
   maybe a dumb question, why don’t we just add immutable annotation on the 
interface in that case? With this, we have:

   ```
   @Value.Immutable
   public interface ViewLogEntry extends ViewHistoryEntry {
 long timestampMillis();
   
 int versionId();
   }
   ```
   
   ```
   public interface ViewHistoryEntry {
 /** Return the timestamp in milliseconds of the change */
 long timestampMillis();
   
 /** Return ID of the new current version */
 int versionId();
   }
   ```



##
core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java:
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import org.immutables.value.Value;
+
+@Value.Immutable
+public interface BaseViewHistoryEntry extends ViewHistoryEntry {

Review Comment:
   maybe a dumb question, why don’t we just add immutable annotation on the 
interface in that case? Currently we have:

   ```
   @Value.Immutable
   public interface ViewLogEntry extends ViewHistoryEntry {
 long timestampMillis();
   
 int versionId();
   }
   ```
   
   ```
   public interface ViewHistoryEntry {
 /** Return the timestamp in milliseconds of the change */
 long timestampMillis();
   
 /** Return ID of the new current version */
 int versionId();
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


amogh-jahagirdar commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070161504


##
core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java:
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import org.immutables.value.Value;
+
+@Value.Immutable
+public interface BaseViewHistoryEntry extends ViewHistoryEntry {

Review Comment:
   yeah nvm on the classpath issue I mentioned, I confused myself. We can just 
add to the base interface and eliminate the duplicate interface



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


amogh-jahagirdar commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070161504


##
core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java:
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import org.immutables.value.Value;
+
+@Value.Immutable
+public interface BaseViewHistoryEntry extends ViewHistoryEntry {

Review Comment:
   Not a dumb question, nvm on the classpath issue I mentioned, I confused 
myself. We can just mark the base interface as immutable and eliminate the 
duplicate interface.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] dennishuo commented on a diff in pull request #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables

2023-01-13 Thread GitBox


dennishuo commented on code in PR #6428:
URL: https://github.com/apache/iceberg/pull/6428#discussion_r1070164244


##
snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java:
##
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.snowflake;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.CatalogProperties;
+import org.apache.iceberg.CatalogUtil;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.SupportsNamespaces;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.NoSuchNamespaceException;
+import org.apache.iceberg.hadoop.Configurable;
+import org.apache.iceberg.io.CloseableGroup;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.jdbc.JdbcClientPool;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class SnowflakeCatalog extends BaseMetastoreCatalog
+implements Closeable, SupportsNamespaces, Configurable {
+  private static final String DEFAULT_CATALOG_NAME = "snowflake_catalog";
+  private static final String DEFAULT_FILE_IO_IMPL = 
"org.apache.iceberg.io.ResolvingFileIO";
+
+  static class FileIOFactory {
+public FileIO newFileIO(String impl, Map properties, 
Object hadoopConf) {
+  return CatalogUtil.loadFileIO(impl, properties, hadoopConf);

Review Comment:
   This was just extracted to preserve the flow of unittest setup to be able to 
return the pre-configured InMemoryFileIO fake instance rather than having 
dynamic classloading get an empty one. The alternative there would've been to 
introduce some static global state in InMemoryFileIO, but that seems prone to 
causing cross-test-case problems.
   
   I could add a comment or annotation here to explain its existence if that'd 
make it cleaner.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] dennishuo commented on a diff in pull request #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables

2023-01-13 Thread GitBox


dennishuo commented on code in PR #6428:
URL: https://github.com/apache/iceberg/pull/6428#discussion_r1070166470


##
snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java:
##
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.snowflake;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.CatalogProperties;
+import org.apache.iceberg.CatalogUtil;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.SupportsNamespaces;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.NoSuchNamespaceException;
+import org.apache.iceberg.hadoop.Configurable;
+import org.apache.iceberg.io.CloseableGroup;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.jdbc.JdbcClientPool;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class SnowflakeCatalog extends BaseMetastoreCatalog
+implements Closeable, SupportsNamespaces, Configurable {
+  private static final String DEFAULT_CATALOG_NAME = "snowflake_catalog";
+  private static final String DEFAULT_FILE_IO_IMPL = 
"org.apache.iceberg.io.ResolvingFileIO";
+
+  static class FileIOFactory {
+public FileIO newFileIO(String impl, Map properties, 
Object hadoopConf) {
+  return CatalogUtil.loadFileIO(impl, properties, hadoopConf);

Review Comment:
   Added code comment to clarify



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] williamhyun opened a new pull request, #6585: Update ORC to 1.8.2

2023-01-13 Thread GitBox


williamhyun opened a new pull request, #6585:
URL: https://github.com/apache/iceberg/pull/6585

   Apache ORC 1.8.2 is the latest version of ORC which brings the following 
changes and bug fixes including an SBOM.
   - https://github.com/apache/orc/releases/tag/v1.8.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] github-actions[bot] commented on issue #5183: Allow to configure Avro block size

2023-01-13 Thread GitBox


github-actions[bot] commented on issue #5183:
URL: https://github.com/apache/iceberg/issues/5183#issuecomment-1382598801

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] github-actions[bot] commented on issue #4607: [Docs] Create an item list for re-organizing docs to the proposed layout

2023-01-13 Thread GitBox


github-actions[bot] commented on issue #4607:
URL: https://github.com/apache/iceberg/issues/4607#issuecomment-1382598842

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] github-actions[bot] closed issue #5183: Allow to configure Avro block size

2023-01-13 Thread GitBox


github-actions[bot] closed issue #5183: Allow to configure Avro block size 
URL: https://github.com/apache/iceberg/issues/5183


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] dramaticlly commented on a diff in pull request #6574: Python: Raise exception on deletes

2023-01-13 Thread GitBox


dramaticlly commented on code in PR #6574:
URL: https://github.com/apache/iceberg/pull/6574#discussion_r1070176238


##
python/pyiceberg/table/__init__.py:
##
@@ -341,7 +346,18 @@ def plan_files(self) -> Iterator[FileScanTask]:
 all_files = files(io.new_input(manifest.manifest_path))
 matching_partition_files = filter(partition_filter, all_files)
 
-yield from (FileScanTask(file) for file in 
matching_partition_files)
+def _check_content(file: ManifestFile) -> ManifestFile:
+try:
+if file.content == ManifestContent.DELETES:
+raise ValueError("PyIceberg does not support deletes: 
https://github.com/apache/iceberg/issues/6568";)

Review Comment:
   shall we have a test to cover this ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

2023-01-13 Thread GitBox


huaxingao commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070181796


##
core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java:
##
@@ -26,4 +26,6 @@ private StandardBlobTypes() {}
* href="https://datasketches.apache.org/";>Apache DataSketches library
*/
   public static final String APACHE_DATASKETCHES_THETA_V1 = 
"apache-datasketches-theta-v1";
+
+  public static final String NDV_BLOB = "ndv-blob";

Review Comment:
   @szehon-ho Right, we should have a better name for this. I am not sure if we 
can have a common blob type here. I will wait for @findepi 's input before 
changing this one. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

2023-01-13 Thread GitBox


huaxingao commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070181906


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java:
##
@@ -0,0 +1,191 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.procedures;
+
+import static java.lang.String.format;
+import static java.util.UUID.randomUUID;
+import static 
org.apache.hadoop.shaded.com.google.common.collect.ImmutableList.toImmutableList;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.nio.ByteBuffer;
+import org.apache.iceberg.GenericBlobMetadata;
+import org.apache.iceberg.GenericStatisticsFile;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.puffin.Blob;
+import org.apache.iceberg.puffin.Puffin;
+import org.apache.iceberg.puffin.PuffinWriter;
+import org.apache.iceberg.puffin.StandardBlobTypes;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.source.SparkTable;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A procedure that gets approximate NDV (number of distinct value) for the 
requested columns
+ * and sets this to the table's StatisticsFile.
+ */

Review Comment:
   We have file level metadata for max, min, num_nulls etc. That's why I was  
hesitate to include those here. We don't have file level ndv, though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

2023-01-13 Thread GitBox


huaxingao commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070181959


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java:
##
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.procedures;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.nio.ByteBuffer;
+import java.util.UUID;
+import org.apache.iceberg.GenericBlobMetadata;
+import org.apache.iceberg.GenericStatisticsFile;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.puffin.Blob;
+import org.apache.iceberg.puffin.Puffin;
+import org.apache.iceberg.puffin.PuffinWriter;
+import org.apache.iceberg.puffin.StandardBlobTypes;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.source.SparkTable;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A procedure that gets approximate NDV (number of distinct value) for the 
requested columns and
+ * sets this to the table's StatisticsFile.
+ */
+public class DistinctCountProcedure extends BaseProcedure {
+  private static final Logger LOG = 
LoggerFactory.getLogger(DistinctCountProcedure.class);
+
+  private static final ProcedureParameter[] PARAMETERS =
+  new ProcedureParameter[] {
+ProcedureParameter.required("table", DataTypes.StringType),
+ProcedureParameter.optional("distinct_count_view", 
DataTypes.StringType),
+ProcedureParameter.optional("columns", STRING_ARRAY),
+  };
+
+  private static final StructType OUTPUT_TYPE =
+  new StructType(
+  new StructField[] {
+new StructField("view_name", DataTypes.StringType, false, 
Metadata.empty())
+  });
+
+  public static SparkProcedures.ProcedureBuilder builder() {
+return new Builder() {
+  @Override
+  protected DistinctCountProcedure doBuild() {
+return new DistinctCountProcedure(tableCatalog());
+  }
+};
+  }
+
+  private DistinctCountProcedure(TableCatalog tableCatalog) {
+super(tableCatalog);
+  }
+
+  @Override
+  public ProcedureParameter[] parameters() {
+return PARAMETERS;
+  }
+
+  @Override
+  public StructType outputType() {
+return OUTPUT_TYPE;
+  }
+
+  @Override
+  public InternalRow[] call(InternalRow args) {
+String tableName = args.getString(0);
+Identifier tableIdent = toIdentifier(tableName, PARAMETERS[0].name());
+SparkTable sparkTable = loadSparkTable(tableIdent);
+StructType schema = sparkTable.schema();
+Table table = sparkTable.table();
+ArrayData columns = args.getArray(2);
+int columnSizes = columns.numElements();
+
+long[] ndvs = new long[columnSizes];
+int[] fieldId = new int[columnSizes];
+String query = "SELECT ";
+for (int i = 0; i < columnSizes; i++) {
+  String colName = columns.getUTF8String(i).toString();
+  query += "APPROX_COUNT_DISTINCT(" + colName + "), ";
+  fieldId[i] = schema.fieldIndex(colName);
+}
+
+query = query.substring(0, query.length() - 2) + " FROM " + tableName;
+Dataset df = spark().sql(query);
+
+for (int i = 0; i < columnSizes; i++) {
+  ndvs[i] = df.head().getLong(i);
+}
+
+TableOperations operations = 

[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

2023-01-13 Thread GitBox


huaxingao commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070182092


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java:
##
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.procedures;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.nio.ByteBuffer;
+import java.util.UUID;
+import org.apache.iceberg.GenericBlobMetadata;
+import org.apache.iceberg.GenericStatisticsFile;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.puffin.Blob;
+import org.apache.iceberg.puffin.Puffin;
+import org.apache.iceberg.puffin.PuffinWriter;
+import org.apache.iceberg.puffin.StandardBlobTypes;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.source.SparkTable;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A procedure that gets approximate NDV (number of distinct value) for the 
requested columns and
+ * sets this to the table's StatisticsFile.
+ */
+public class DistinctCountProcedure extends BaseProcedure {
+  private static final Logger LOG = 
LoggerFactory.getLogger(DistinctCountProcedure.class);
+
+  private static final ProcedureParameter[] PARAMETERS =
+  new ProcedureParameter[] {
+ProcedureParameter.required("table", DataTypes.StringType),
+ProcedureParameter.optional("distinct_count_view", 
DataTypes.StringType),
+ProcedureParameter.optional("columns", STRING_ARRAY),
+  };
+
+  private static final StructType OUTPUT_TYPE =
+  new StructType(
+  new StructField[] {
+new StructField("view_name", DataTypes.StringType, false, 
Metadata.empty())
+  });
+
+  public static SparkProcedures.ProcedureBuilder builder() {
+return new Builder() {
+  @Override
+  protected DistinctCountProcedure doBuild() {
+return new DistinctCountProcedure(tableCatalog());
+  }
+};
+  }
+
+  private DistinctCountProcedure(TableCatalog tableCatalog) {
+super(tableCatalog);
+  }
+
+  @Override
+  public ProcedureParameter[] parameters() {
+return PARAMETERS;
+  }
+
+  @Override
+  public StructType outputType() {
+return OUTPUT_TYPE;
+  }
+
+  @Override
+  public InternalRow[] call(InternalRow args) {
+String tableName = args.getString(0);
+Identifier tableIdent = toIdentifier(tableName, PARAMETERS[0].name());
+SparkTable sparkTable = loadSparkTable(tableIdent);
+StructType schema = sparkTable.schema();
+Table table = sparkTable.table();
+ArrayData columns = args.getArray(2);
+int columnSizes = columns.numElements();
+
+long[] ndvs = new long[columnSizes];
+int[] fieldId = new int[columnSizes];
+String query = "SELECT ";
+for (int i = 0; i < columnSizes; i++) {
+  String colName = columns.getUTF8String(i).toString();
+  query += "APPROX_COUNT_DISTINCT(" + colName + "), ";
+  fieldId[i] = schema.fieldIndex(colName);
+}
+
+query = query.substring(0, query.length() - 2) + " FROM " + tableName;
+Dataset df = spark().sql(query);
+
+for (int i = 0; i < columnSizes; i++) {
+  ndvs[i] = df.head().getLong(i);
+}
+
+TableOperations operations = 

[GitHub] [iceberg] jackye1995 commented on pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


jackye1995 commented on PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#issuecomment-1382622527

   Since the immutable is addressed and we have enough approvals, I will go 
ahead to merge the PR, thanks for the review @rdblue and @nastra !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 merged pull request #6565: Core: View history entry core implementation

2023-01-13 Thread GitBox


jackye1995 merged PR #6565:
URL: https://github.com/apache/iceberg/pull/6565


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on pull request #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables

2023-01-13 Thread GitBox


jackye1995 commented on PR #6428:
URL: https://github.com/apache/iceberg/pull/6428#issuecomment-1382624709

   Looks like some CI tests are failing? Could you check? Maybe need to rebase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6573: Docs: Add information on how to read from branches and tags in Spark docs

2023-01-13 Thread GitBox


jackye1995 commented on code in PR #6573:
URL: https://github.com/apache/iceberg/pull/6573#discussion_r1070189868


##
docs/spark-queries.md:
##
@@ -126,6 +126,8 @@ To select a specific table snapshot or the snapshot at some 
time in the DataFram
 
 * `snapshot-id` selects a specific table snapshot
 * `as-of-timestamp` selects the current snapshot at a timestamp, in 
milliseconds
+* `branch` selects the head snapshot of the specified branch. Note that 
currently branch cannot be combined with as-of-timestamp.
+* `tag` selects the snapshot associated with the specified tag

Review Comment:
   Given that is a syntax change, I am waiting for more time for others to take 
a look. I think we can first merge this one and add that later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on pull request #6573: Docs: Add information on how to read from branches and tags in Spark docs

2023-01-13 Thread GitBox


jackye1995 commented on PR #6573:
URL: https://github.com/apache/iceberg/pull/6573#issuecomment-1382625570

   Thanks everyone for the review, as I said in the thread for the SQL related 
changes, I will wait for some more time in case there are disagreements. I will 
merge this in first and we can add follow up PRs at this front.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 merged pull request #6573: Docs: Add information on how to read from branches and tags in Spark docs

2023-01-13 Thread GitBox


jackye1995 merged PR #6573:
URL: https://github.com/apache/iceberg/pull/6573


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] renshangtao commented on pull request #6577: Fix the problem of shadow jar without modifying meta-inf/services at the same time

2023-01-13 Thread GitBox


renshangtao commented on PR #6577:
URL: https://github.com/apache/iceberg/pull/6577#issuecomment-1382632267

   @nastra thank you for your reply. After this modification, the code can be 
executed as expected. I open the runtime jar package, and the class 
corresponding to shadowjar in META-INF/services has changed to 
org.apache.iceberg.shaded.org.apache.orc.inpl.maskMaskProvider, which should be 
correct


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on pull request #6575: Spark 3.3: support version travel by reference name

2023-01-13 Thread GitBox


jackye1995 commented on PR #6575:
URL: https://github.com/apache/iceberg/pull/6575#issuecomment-1382632711

   @aokolnychyi @RussellSpitzer @rdblue any opinions about this support?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 opened a new pull request, #6586: AWS: make warehouse path optional for read only catalog use cases

2023-01-13 Thread GitBox


jackye1995 opened a new pull request, #6586:
URL: https://github.com/apache/iceberg/pull/6586

   Currently no matter in what situation warehouse path must be specified, but 
in many cases the user just want to initialize Glue catalog to read data, and 
don't want to pass in a warehouse path. This is to support that use case 
without user having to specify a dummy warehouse path to bypass the checks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] dennishuo commented on pull request #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables

2023-01-13 Thread GitBox


dennishuo commented on PR #6428:
URL: https://github.com/apache/iceberg/pull/6428#issuecomment-1382648326

   @jackye1995 Thanks for the heads up! Looks like merging to head fixed it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] danielcweeks merged pull request #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables

2023-01-13 Thread GitBox


danielcweeks merged PR #6428:
URL: https://github.com/apache/iceberg/pull/6428


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] dmgcodevil opened a new issue, #6587: Wrong class, java.lang.Long, for object: 19367

2023-01-13 Thread GitBox


dmgcodevil opened a new issue, #6587:
URL: https://github.com/apache/iceberg/issues/6587

   ### Apache Iceberg version
   
   None
   
   ### Query engine
   
   None
   
   ### Please describe the bug 🐞
   
   I have a timestamp field of type: `timestamptz`. I'm trying to compact files 
using Spark action but getting the following error:
   
   ```
   Wrong class, java.lang.Long, for object: 19367
   ```
   
   My schema:
   
   ```
   {
 "id" : 10,
 "name" : "tick_timestamp",
 "required" : false,
 "type" : "timestamptz"
   }, {
 "id" : 11,
 "name" : "transaction_timestamp",
 "required" : false,
 "type" : "timestamptz"
   }
   ```
   
   
   Partition specs:
   
   ```json
   "partition-spec" : [ {
   "name" : "transaction_timestamp_day",
   "transform" : "void",
   "source-id" : 11,
   "field-id" : 1000
 }, {
   "name" : "id059_bucket",
   "transform" : "bucket[20]",
   "source-id" : 5,
   "field-id" : 1001
 }, {
   "name" : "tick_timestamp_day",
   "transform" : "day",
   "source-id" : 10,
   "field-id" : 1002
 } ],
 "default-spec-id" : 2,
 "partition-specs" : [ {
   "spec-id" : 0,
   "fields" : [ {
 "name" : "transaction_timestamp_day",
 "transform" : "day",
 "source-id" : 11,
 "field-id" : 1000
   }, {
 "name" : "id059_bucket",
 "transform" : "bucket[20]",
 "source-id" : 5,
 "field-id" : 1001
   } ]
 }, {
   "spec-id" : 1,
   "fields" : [ {
 "name" : "tick_timestamp_day",
 "transform" : "day",
 "source-id" : 11,
 "field-id" : 1000
   }, {
 "name" : "id059_bucket",
 "transform" : "bucket[20]",
 "source-id" : 5,
 "field-id" : 1001
   } ]
 }, {
   "spec-id" : 2,
   "fields" : [ {
 "name" : "transaction_timestamp_day",
 "transform" : "void",
 "source-id" : 11,
 "field-id" : 1000
   }, {
 "name" : "id059_bucket",
 "transform" : "bucket[20]",
 "source-id" : 5,
 "field-id" : 1001
   }, {
 "name" : "tick_timestamp_day",
 "transform" : "day",
 "source-id" : 10,
 "field-id" : 1002
   } ]
 } ]
   ```
   
   Test:
   
   ```java
   val task = table.newScan().filter().planFiles().asScala.head
   
Comparators.forType(table.spec.partitionType()).compare(task.file().partition(),
 task.file().partition())
   ```
   
   Error:
   
   ```
   Wrong class, java.lang.Long, for object: 19367
   ```
   
   The one thing I noticed is that `` returns the following type: 
   
   ```
   
{"type":"record","name":"PartitionData","namespace":"org.apache.iceberg","fields":[{"name":"tick_timestamp_day","type":["null",{"type":"int","logicalType":"date"}],"default":null,"field-id":1000},{"name":"id059_bucket","type":["null","int"],"default":null,"field-id":1001}]}
   ```
   
   `tick_timestamp_day` has type `date` (int) instead of `timestamptz`. Any 
thoughts ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] singhpk234 commented on pull request #6576: AWS: Fix check for isTableRegisteredWithLF leading to CREATE table failure

2023-01-13 Thread GitBox


singhpk234 commented on PR #6576:
URL: https://github.com/apache/iceberg/pull/6576#issuecomment-1382673430

   did some more digging, posting what i found so far, this issue, is only 
observed in 0.14.x and 1.0.0 (and I directly tested my fix on top of master 
:sweat_smile:), 1.1.0 & master are fine, This is because starting 1.1.0 we now 
create a glue table (dummy table if the commit not succeeds, to be removed 
later) before writing s3 metadata files (which does the isTableEnroledinLF 
check, and now since the table is already created this check passes), prior to 
this the table creation use to happen after s3 write, which would result in 
failure.
   
   This above behaviour makes sense considering that this pr should no longer 
be required atleast to the master (happy to create backport pr's for lower 
versions if required), hence closing this pr.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] singhpk234 closed pull request #6576: AWS: Fix check for isTableRegisteredWithLF leading to CREATE table failure

2023-01-13 Thread GitBox


singhpk234 closed pull request #6576: AWS: Fix check for 
isTableRegisteredWithLF leading to CREATE table failure
URL: https://github.com/apache/iceberg/pull/6576


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] singhpk234 commented on issue #6523: Table creation fails with Glue catalog on EMR

2023-01-13 Thread GitBox


singhpk234 commented on issue #6523:
URL: https://github.com/apache/iceberg/issues/6523#issuecomment-1382673665

   please consider using iceberg 1.1.0 release, instead, it has fix for the 
failure added as part of : 
   - https://github.com/apache/iceberg/pull/4423/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] ajantha-bhat commented on pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

2023-01-13 Thread GitBox


ajantha-bhat commented on PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#issuecomment-1382677828

   > I'm still a little worried about saving this information, what does 
knowing the sort order mean for a file. Are we guaranteeing that the file is 
locally sorted by that order? or globally sorted with files added in that 
snapshot. I'm cautious about doing anything here until we really define what it 
means.
   
   **Spec says this. I am just trying to make the implementation the same as 
the spec. I am ok with changing the spec or implementation.** 
   
   > `140 sort_order_id` - ID representing sort order for this file [3].
   If the sort order ID is missing or unknown, then the order is assumed to be 
unsorted. Only data files and equality delete files should be written with a 
non-null order id. [Position 
deletes](https://iceberg.apache.org/spec/#position-delete-files) are required 
to be sorted by file and position, not a table order, and should set sort order 
id to null. Readers must ignore sort order id for position delete files.
   
   cc: @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] nastra commented on a diff in pull request #6575: Spark 3.3: support version travel by reference name

2023-01-14 Thread GitBox


nastra commented on code in PR #6575:
URL: https://github.com/apache/iceberg/pull/6575#discussion_r1070237113


##
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java:
##
@@ -231,6 +234,67 @@ public void testVersionAsOf() {
 assertEquals("Snapshot at specific ID " + snapshotId, expected, fromDF);
   }
 
+  @Test
+  public void testTagReferenceAsOf() {
+Table table = validationCatalog.loadTable(tableIdent);
+long snapshotId = table.currentSnapshot().snapshotId();
+table.manageSnapshots().createTag("test_tag", snapshotId).commit();
+
+// create a second snapshot, read the table at the snapshot
+List expected = sql("SELECT * FROM %s", tableName);
+sql("INSERT INTO %s VALUES (4, 'd', 4.0), (5, 'e', 5.0)", tableName);
+List actual1 = sql("SELECT * FROM %s VERSION AS OF 'test_tag'", 
tableName);
+assertEquals("Snapshot at specific tag reference name", expected, actual1);
+
+// read the table at the snapshot
+// HIVE time travel syntax
+List actual2 = sql("SELECT * FROM %s FOR SYSTEM_VERSION AS OF 
'test_tag'", tableName);
+assertEquals("Snapshot at specific tag reference name", expected, actual2);
+
+// read the table using DataFrameReader option: branch
+Dataset df =
+spark.read().format("iceberg").option(SparkReadOptions.TAG, 
"test_tag").load(tableName);
+List fromDF = rowsToJava(df.collectAsList());
+assertEquals("Snapshot at specific tag reference name", expected, fromDF);
+  }
+
+  @Test
+  public void testBranchReferenceAsOf() {
+Table table = validationCatalog.loadTable(tableIdent);
+long snapshotId = table.currentSnapshot().snapshotId();
+table.manageSnapshots().createBranch("test_branch", snapshotId).commit();
+
+// create a second snapshot, read the table at the snapshot
+List expected = sql("SELECT * FROM %s", tableName);
+sql("INSERT INTO %s VALUES (4, 'd', 4.0), (5, 'e', 5.0)", tableName);
+List actual1 = sql("SELECT * FROM %s VERSION AS OF 
'test_branch'", tableName);
+assertEquals("Snapshot at specific branch reference name", expected, 
actual1);
+
+// read the table at the snapshot
+// HIVE time travel syntax
+List actual2 =
+sql("SELECT * FROM %s FOR SYSTEM_VERSION AS OF 'test_branch'", 
tableName);
+assertEquals("Snapshot at specific branch reference name", expected, 
actual2);
+
+// read the table using DataFrameReader option: branch
+Dataset df =
+spark
+.read()
+.format("iceberg")
+.option(SparkReadOptions.BRANCH, "test_branch")
+.load(tableName);
+List fromDF = rowsToJava(df.collectAsList());
+assertEquals("Snapshot at specific branch reference name", expected, 
fromDF);
+  }
+
+  @Test
+  public void testUnknownReferenceAsOf() {
+Assertions.assertThatThrownBy(
+() -> sql("SELECT * FROM %s VERSION AS OF 'test_unknown'", 
tableName))
+.as("Cannot find matching snapshot ID or reference name for version")

Review Comment:
   I think you rather want to have `hasMessage(..)` or 
`hasMessageContaining(...)` here instead of `as(..)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] nastra commented on pull request #6577: Fix the problem of shadow jar without modifying meta-inf/services at the same time

2023-01-14 Thread GitBox


nastra commented on PR #6577:
URL: https://github.com/apache/iceberg/pull/6577#issuecomment-1382694306

   > @nastra thank you for your reply. After this modification, the code can be 
executed as expected. I open the runtime jar package, and the class 
corresponding to shadowjar in META-INF/services has changed to 
org.apache.iceberg.shaded.org.apache.orc.inpl.maskMaskProvider, which should be 
correct
   
   oh you're right, it shows up as the relocated name, all good then and still 
+1 on merging this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko commented on pull request #6525: Python: Refactor loading manifests

2023-01-14 Thread GitBox


Fokko commented on PR #6525:
URL: https://github.com/apache/iceberg/pull/6525#issuecomment-1382722862

   Thanks for the thorough review and PR @rdblue!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko merged pull request #6525: Python: Refactor loading manifests

2023-01-14 Thread GitBox


Fokko merged PR #6525:
URL: https://github.com/apache/iceberg/pull/6525


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko merged pull request #6585: Update ORC to 1.8.2

2023-01-14 Thread GitBox


Fokko merged PR #6585:
URL: https://github.com/apache/iceberg/pull/6585


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko commented on a diff in pull request #6574: Python: Raise exception on deletes

2023-01-14 Thread GitBox


Fokko commented on code in PR #6574:
URL: https://github.com/apache/iceberg/pull/6574#discussion_r1070259471


##
python/pyiceberg/table/__init__.py:
##
@@ -341,7 +346,18 @@ def plan_files(self) -> Iterator[FileScanTask]:
 all_files = files(io.new_input(manifest.manifest_path))
 matching_partition_files = filter(partition_filter, all_files)
 
-yield from (FileScanTask(file) for file in 
matching_partition_files)
+def _check_content(file: ManifestFile) -> ManifestFile:
+try:
+if file.content == ManifestContent.DELETES:
+raise ValueError("PyIceberg does not support deletes: 
https://github.com/apache/iceberg/issues/6568";)

Review Comment:
   Good call, I ran some table scans locally, but great idea to have this 
unit-tested as well πŸ‘πŸ» 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] RussellSpitzer opened a new pull request, #6588: Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes

2023-01-14 Thread GitBox


RussellSpitzer opened a new pull request, #6588:
URL: https://github.com/apache/iceberg/pull/6588

   An issue we've run into frequently is that several Spark actions perform 
deletes on the driver with a default parallelism of 1. This is quite slow for 
S3 and painfully slow for very large tables. To fix this we change the default 
behavior to always be multithreaded deletes.
   
   The default for all Spark related actions can then be changed with a SQL 
Conf parameter as well as within each command with their own parallelism 
parameters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] RussellSpitzer commented on pull request #6588: Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes

2023-01-14 Thread GitBox


RussellSpitzer commented on PR #6588:
URL: https://github.com/apache/iceberg/pull/6588#issuecomment-1382730394

   @anuragmantri + @aokolnychyi + @rdblue - This is a bit of a big default 
behavior change but it's been biting a lot of our users lately and the change 
is relatively safe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6588: Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes

2023-01-14 Thread GitBox


RussellSpitzer commented on code in PR #6588:
URL: https://github.com/apache/iceberg/pull/6588#discussion_r1070261827


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java:
##
@@ -47,4 +47,8 @@ private SparkSQLProperties() {}
   public static final String PRESERVE_DATA_GROUPING =
   "spark.sql.iceberg.planning.preserve-data-grouping";
   public static final boolean PRESERVE_DATA_GROUPING_DEFAULT = false;
+
+  // Controls how many physical file deletes to execute in parallel when not 
otherwise specified
+  public static final String DELETE_PARALLELISM = 
"driver-delete-default-parallelism";
+  public static final String DELETE_PARALLELISM_DEFAULT = "25";

Review Comment:
   With S3's request throttling around 4k requests a second this gives us a lot 
of overhead. 
   Assuming a 50ms response time
   4000 max requests / Second / 20 requests per thread per second =~  200 max 
concurrent requests. 
   
   Another option for this is to also incorporate the "bulk delete" apis but 
that would only help with S3 based filesystems.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6588: Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes

2023-01-14 Thread GitBox


RussellSpitzer commented on code in PR #6588:
URL: https://github.com/apache/iceberg/pull/6588#discussion_r1070261893


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java:
##
@@ -231,24 +258,27 @@ protected DeleteSummary deleteFiles(
 
 DeleteSummary summary = new DeleteSummary();
 
-Tasks.foreach(files)
-.retry(DELETE_NUM_RETRIES)
-.stopRetryOn(NotFoundException.class)
-.suppressFailureWhenFinished()
-.executeWith(executorService)
-.onFailure(
-(fileInfo, exc) -> {
-  String path = fileInfo.getPath();
-  String type = fileInfo.getType();
-  LOG.warn("Delete failed for {}: {}", type, path, exc);
-})
-.run(
-fileInfo -> {
-  String path = fileInfo.getPath();
-  String type = fileInfo.getType();
-  deleteFunc.accept(path);
-  summary.deletedFile(path, type);
-});
+
+withDefaultDeleteService(executorService, (deleteService) -> {

Review Comment:
   This covers ExpireSnapshots and DropTable with Purge



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6588: Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes

2023-01-14 Thread GitBox


RussellSpitzer commented on code in PR #6588:
URL: https://github.com/apache/iceberg/pull/6588#discussion_r1070261893


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java:
##
@@ -231,24 +258,27 @@ protected DeleteSummary deleteFiles(
 
 DeleteSummary summary = new DeleteSummary();
 
-Tasks.foreach(files)
-.retry(DELETE_NUM_RETRIES)
-.stopRetryOn(NotFoundException.class)
-.suppressFailureWhenFinished()
-.executeWith(executorService)
-.onFailure(
-(fileInfo, exc) -> {
-  String path = fileInfo.getPath();
-  String type = fileInfo.getType();
-  LOG.warn("Delete failed for {}: {}", type, path, exc);
-})
-.run(
-fileInfo -> {
-  String path = fileInfo.getPath();
-  String type = fileInfo.getType();
-  deleteFunc.accept(path);
-  summary.deletedFile(path, type);
-});
+
+withDefaultDeleteService(executorService, (deleteService) -> {

Review Comment:
   This covers ExpireSnapshots and DropTable with Purge (via 
DeleteReachableFilesAction)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6588: Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes

2023-01-14 Thread GitBox


RussellSpitzer commented on code in PR #6588:
URL: https://github.com/apache/iceberg/pull/6588#discussion_r1070261927


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##
@@ -246,12 +246,13 @@ private DeleteOrphanFiles.Result doExecute() {
 List orphanFiles =
 findOrphanFiles(spark(), actualFileIdentDS, validFileIdentDS, 
prefixMismatchMode);
 
-Tasks.foreach(orphanFiles)
-.noRetry()
-.executeWith(deleteExecutorService)
-.suppressFailureWhenFinished()
-.onFailure((file, exc) -> LOG.warn("Failed to delete file: {}", file, 
exc))
-.run(deleteFunc::accept);
+withDefaultDeleteService(deleteExecutorService, deleteService ->

Review Comment:
   This change covers RemoveOrphanFiles



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6575: Spark 3.3: support version travel by reference name

2023-01-14 Thread GitBox


RussellSpitzer commented on code in PR #6575:
URL: https://github.com/apache/iceberg/pull/6575#discussion_r1070263376


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java:
##
@@ -159,7 +160,15 @@ public Table loadTable(Identifier ident, String version) 
throws NoSuchTableExcep
   sparkTable.snapshotId() == null,
   "Cannot do time-travel based on both table identifier and AS OF");
 
-  return sparkTable.copyWithSnapshotId(Long.parseLong(version));
+  try {

Review Comment:
   I'm not sure this can come up, but do we allow version tags to be 
SnapshotIds? 
   
   Like can I tag snapshot 2 to be known as 1?
   
   Weird edge case so I don't think we really need to handle it, just thinking 
if this is a potential issue with the lookup code here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] RussellSpitzer commented on issue #6587: Wrong class, java.lang.Long, for object: 19367

2023-01-14 Thread GitBox


RussellSpitzer commented on issue #6587:
URL: https://github.com/apache/iceberg/issues/6587#issuecomment-1382733719

   Could you post the full trace from the Spark code?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



  1   2   3   4   5   6   7   8   9   10   >