[GitHub] [iceberg] Fokko opened a new pull request, #6580: Python: Bump pylint
Fokko opened a new pull request, #6580: URL: https://github.com/apache/iceberg/pull/6580 And remove the temporary hack -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6525: Python: Refactor loading manifests
Fokko commented on code in PR #6525: URL: https://github.com/apache/iceberg/pull/6525#discussion_r1069592946 ## python/pyiceberg/avro/reader.py: ## @@ -238,41 +249,50 @@ def skip(self, decoder: BinaryDecoder) -> None: return self.option.skip(decoder) -class StructProtocolReader(Reader): -create_struct: Callable[[], StructProtocol] -fields: Tuple[Tuple[Optional[int], Reader], ...] +class StructReader(Reader): +field_readers: Tuple[Tuple[Optional[int], Reader], ...] +create_struct: Type[StructProtocol] +struct: Optional[StructType] -def __init__(self, fields: Tuple[Tuple[Optional[int], Reader], ...], create_struct: Callable[[], StructProtocol]): -self.create_struct = create_struct -self.fields = fields - -def create_or_reuse(self, reuse: Optional[StructProtocol]) -> StructProtocol: -if reuse: -return reuse -else: -return self.create_struct() +def __init__( +self, +field_readers: Tuple[Tuple[Optional[int], Reader], ...], +create_struct: Optional[Type[StructProtocol]] = None, Review Comment: Inspecting gives different results amongst different Python version π It works fine locally on 3.10, but CI fails on 3.8. Looks like it doesn't look up the constructor of the class that it inherits from: ``` FullArgSpec(args=['cls'], varargs='args', varkw='kwds', defaults=None, kwonlyargs=[], kwonlydefaults=None, annotations={}) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6525: Python: Refactor loading manifests
Fokko commented on code in PR #6525: URL: https://github.com/apache/iceberg/pull/6525#discussion_r1069602456 ## python/pyiceberg/avro/reader.py: ## @@ -238,41 +249,50 @@ def skip(self, decoder: BinaryDecoder) -> None: return self.option.skip(decoder) -class StructProtocolReader(Reader): -create_struct: Callable[[], StructProtocol] -fields: Tuple[Tuple[Optional[int], Reader], ...] +class StructReader(Reader): +field_readers: Tuple[Tuple[Optional[int], Reader], ...] +create_struct: Type[StructProtocol] +struct: Optional[StructType] -def __init__(self, fields: Tuple[Tuple[Optional[int], Reader], ...], create_struct: Callable[[], StructProtocol]): -self.create_struct = create_struct -self.fields = fields - -def create_or_reuse(self, reuse: Optional[StructProtocol]) -> StructProtocol: -if reuse: -return reuse -else: -return self.create_struct() +def __init__( +self, +field_readers: Tuple[Tuple[Optional[int], Reader], ...], +create_struct: Optional[Type[StructProtocol]] = None, Review Comment: Moving to an _itβs easier to ask for forgiveness than permission_ approach -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-docs] InvisibleProgrammer commented on pull request #191: Fix sidebar
InvisibleProgrammer commented on PR #191: URL: https://github.com/apache/iceberg-docs/pull/191#issuecomment-1382031000 @samredai : thx for the approve. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-docs] RussellSpitzer merged pull request #191: Fix sidebar
RussellSpitzer merged PR #191: URL: https://github.com/apache/iceberg-docs/pull/191 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-docs] RussellSpitzer commented on pull request #191: Fix sidebar
RussellSpitzer commented on PR #191: URL: https://github.com/apache/iceberg-docs/pull/191#issuecomment-1382043820 Thanks for the review @samredai and thank you @InvisibleProgrammer for the pr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho opened a new pull request, #6581: Spark 3.3: Add RemoveDanglingDeletes action
szehon-ho opened a new pull request, #6581: URL: https://github.com/apache/iceberg/pull/6581 his adds an action to cleanup dangling (invalid) DeleteFiles that may otherwise keep getting carried over with the table's current snapshot. The problem and design doc is here: https://docs.google.com/document/d/11d-cIUR_89kRsMmWnEoxXGZCvp7L4TUmPJqUC60zB5M/edit# In a nutshell, the current table-wide mechanism is crude and may miss many instances of aging off invalid DeleteFiles, even after compaction, and negatively impact performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] krvikash commented on a diff in pull request #6499: AWS, Core, Hive: Fix `checkCommitStatus` when create table commit fails
krvikash commented on code in PR #6499: URL: https://github.com/apache/iceberg/pull/6499#discussion_r1069695013 ## hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCommits.java: ## @@ -307,6 +312,61 @@ public void testAlreadyExistsException() { () -> catalog.createTable(TABLE_IDENTIFIER, schema, PartitionSpec.unpartitioned())); } + @Test + public void testCreateTableCommitFailure() throws TException, InterruptedException { Review Comment: Thanks, @nastra for the review. Yes, you are right `testCreateTableCommitFailure` will not reach `checkCommitStatus `. I have not seen any test case which covers create table failure which is why I have added this test case to verify the create table failure where we are not required to check the commit status. Now I have separated this change into a new commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] stevenzwu merged pull request #6572: Flink: backport PR #6337, PR #6426, PR #6557 to Flink 1.14 and 1.15 fβ¦
stevenzwu merged PR #6572: URL: https://github.com/apache/iceberg/pull/6572 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] stevenzwu commented on pull request #6572: Flink: backport PR #6337, PR #6426, PR #6557 to Flink 1.14 and 1.15 fβ¦
stevenzwu commented on PR #6572: URL: https://github.com/apache/iceberg/pull/6572#issuecomment-1382135319 thanks @pvary for review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] stevenzwu merged pull request #6401: Flink: Change to oldestAncestorAfter
stevenzwu merged PR #6401: URL: https://github.com/apache/iceberg/pull/6401 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] stevenzwu commented on pull request #6401: Flink: Change to oldestAncestorAfter
stevenzwu commented on PR #6401: URL: https://github.com/apache/iceberg/pull/6401#issuecomment-1382149280 thanks @hililiwei for the contribution -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] stevenzwu merged pull request #6222: Flink: Support inspecting table
stevenzwu merged PR #6222: URL: https://github.com/apache/iceberg/pull/6222 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] stevenzwu commented on pull request #6222: Flink: Support inspecting table
stevenzwu commented on PR #6222: URL: https://github.com/apache/iceberg/pull/6222#issuecomment-1382153049 thanks @hililiwei for contributing this major feature -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6517: Parquet: Fixes Incorrect Skipping of RowGroups with NaNs
rdblue commented on code in PR #6517: URL: https://github.com/apache/iceberg/pull/6517#discussion_r1069784838 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java: ## @@ -560,24 +560,26 @@ private T max(Statistics statistics, int id) { } /** - * Checks against older versions of Parquet statistics which may have a null count but undefined - * min and max statistics. Returns true if nonNull values exist in the row group but no further - * statistics are available. - * - * We can't use {@code statistics.hasNonNullValue()} because it is inaccurate with older files - * and will return false if min and max are not set. + * Older versions of Parquet statistics which may have a null count but undefined min and max + * statistics. This is similar to the current behavior when NaN values are present. * * This is specifically for 1.5.0-CDH Parquet builds and later which contain the different * unusual hasNonNull behavior. OSS Parquet builds are not effected because PARQUET-251 prohibits * the reading of these statistics from versions of Parquet earlier than 1.8.0. * * @param statistics Statistics to check - * @param valueCount Number of values in the row group - * @return true if nonNull values exist and no other stats can be used + * @return true if min and max statistics are null */ - static boolean hasNonNullButNoMinMax(Statistics statistics, long valueCount) { -return statistics.getNumNulls() < valueCount -&& (statistics.getMaxBytes() == null || statistics.getMinBytes() == null); + static boolean nullMinMax(Statistics statistics) { +return statistics.getMaxBytes() == null || statistics.getMinBytes() == null; + } + + static boolean minMaxUndefined(Statistics statistics) { +return (!statistics.isEmpty() && !statistics.hasNonNullValue()) || nullMinMax(statistics); Review Comment: Sounds good. It's odd that this is not based on the number of values and number of nulls. Okay though, I guess it's just a strange API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue merged pull request #6517: Parquet: Fixes Incorrect Skipping of RowGroups with NaNs
rdblue merged PR #6517: URL: https://github.com/apache/iceberg/pull/6517 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue closed issue #6516: Parquet: Metric Row Group Filter handles Undefined Min/Max incorrectly Missing Rows
rdblue closed issue #6516: Parquet: Metric Row Group Filter handles Undefined Min/Max incorrectly Missing Rows URL: https://github.com/apache/iceberg/issues/6516 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6517: Parquet: Fixes Incorrect Skipping of RowGroups with NaNs
RussellSpitzer commented on code in PR #6517: URL: https://github.com/apache/iceberg/pull/6517#discussion_r1069785830 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java: ## @@ -560,24 +560,26 @@ private T max(Statistics statistics, int id) { } /** - * Checks against older versions of Parquet statistics which may have a null count but undefined - * min and max statistics. Returns true if nonNull values exist in the row group but no further - * statistics are available. - * - * We can't use {@code statistics.hasNonNullValue()} because it is inaccurate with older files - * and will return false if min and max are not set. + * Older versions of Parquet statistics which may have a null count but undefined min and max + * statistics. This is similar to the current behavior when NaN values are present. * * This is specifically for 1.5.0-CDH Parquet builds and later which contain the different * unusual hasNonNull behavior. OSS Parquet builds are not effected because PARQUET-251 prohibits * the reading of these statistics from versions of Parquet earlier than 1.8.0. * * @param statistics Statistics to check - * @param valueCount Number of values in the row group - * @return true if nonNull values exist and no other stats can be used + * @return true if min and max statistics are null */ - static boolean hasNonNullButNoMinMax(Statistics statistics, long valueCount) { -return statistics.getNumNulls() < valueCount -&& (statistics.getMaxBytes() == null || statistics.getMinBytes() == null); + static boolean nullMinMax(Statistics statistics) { +return statistics.getMaxBytes() == null || statistics.getMinBytes() == null; + } + + static boolean minMaxUndefined(Statistics statistics) { +return (!statistics.isEmpty() && !statistics.hasNonNullValue()) || nullMinMax(statistics); Review Comment: It is quite strange -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6575: Spark 3.3: support version travel by reference name
jackye1995 commented on code in PR #6575: URL: https://github.com/apache/iceberg/pull/6575#discussion_r1069789840 ## spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java: ## @@ -231,6 +233,68 @@ public void testVersionAsOf() { assertEquals("Snapshot at specific ID " + snapshotId, expected, fromDF); } + @Test + public void testTagReferenceAsOf() { +Table table = validationCatalog.loadTable(tableIdent); +long snapshotId = table.currentSnapshot().snapshotId(); +table.manageSnapshots().createTag("test_tag", snapshotId).commit(); + +// create a second snapshot, read the table at the snapshot +List expected = sql("SELECT * FROM %s", tableName); +sql("INSERT INTO %s VALUES (4, 'd', 4.0), (5, 'e', 5.0)", tableName); +List actual1 = sql("SELECT * FROM %s VERSION AS OF 'test_tag'", tableName); +assertEquals("Snapshot at specific tag reference name", expected, actual1); + +// read the table at the snapshot +// HIVE time travel syntax +List actual2 = sql("SELECT * FROM %s FOR SYSTEM_VERSION AS OF 'test_tag'", tableName); +assertEquals("Snapshot at specific tag reference name", expected, actual2); + +// read the table using DataFrameReader option: branch +Dataset df = +spark.read().format("iceberg").option(SparkReadOptions.TAG, "test_tag").load(tableName); +List fromDF = rowsToJava(df.collectAsList()); +assertEquals("Snapshot at specific tag reference name", expected, fromDF); + } + + @Test + public void testBranchReferenceAsOf() { +Table table = validationCatalog.loadTable(tableIdent); +long snapshotId = table.currentSnapshot().snapshotId(); +table.manageSnapshots().createBranch("test_branch", snapshotId).commit(); + +// create a second snapshot, read the table at the snapshot +List expected = sql("SELECT * FROM %s", tableName); +sql("INSERT INTO %s VALUES (4, 'd', 4.0), (5, 'e', 5.0)", tableName); +List actual1 = sql("SELECT * FROM %s VERSION AS OF 'test_branch'", tableName); +assertEquals("Snapshot at specific branch reference name", expected, actual1); + +// read the table at the snapshot +// HIVE time travel syntax +List actual2 = +sql("SELECT * FROM %s FOR SYSTEM_VERSION AS OF 'test_branch'", tableName); +assertEquals("Snapshot at specific branch reference name", expected, actual2); + +// read the table using DataFrameReader option: branch +Dataset df = +spark +.read() +.format("iceberg") +.option(SparkReadOptions.BRANCH, "test_branch") +.load(tableName); +List fromDF = rowsToJava(df.collectAsList()); +assertEquals("Snapshot at specific branch reference name", expected, fromDF); + } + + @Test + public void testUnknownReferenceAsOf() { +AssertHelpers.assertThrows( +"Cannot find matching snapshot ID or reference name for version", Review Comment: sure, I can update to use that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6575: Spark 3.3: support version travel by reference name
jackye1995 commented on code in PR #6575: URL: https://github.com/apache/iceberg/pull/6575#discussion_r1069798452 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java: ## @@ -159,7 +160,15 @@ public Table loadTable(Identifier ident, String version) throws NoSuchTableExcep sparkTable.snapshotId() == null, "Cannot do time-travel based on both table identifier and AS OF"); - return sparkTable.copyWithSnapshotId(Long.parseLong(version)); + try { +return sparkTable.copyWithSnapshotId(Long.parseLong(version)); + } catch (NumberFormatException e) { +SnapshotRef ref = sparkTable.table().refs().get(version); +ValidationException.check( +ref != null, +"Cannot find matching snapshot ID or reference name for version " + version); +return sparkTable.copyWithSnapshotId(ref.snapshotId()); Review Comment: I think one thing that might be useful is to time travel in a branch, something like `FOR SYSTEM_VERSION AS OF branchName@123456789012`. But that feels very hacky, I' rather have some syntax as we have been suggesting like `SELECT * FROM table BRANCH branch FOR SYSTEM_TIME AS OF xxx`. So I am leaving that part out of the implementation for now. At least I think most people can agree that a tag/branch head can be viewed as a version to travel to. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests
rdblue commented on code in PR #6525: URL: https://github.com/apache/iceberg/pull/6525#discussion_r1069808528 ## python/tests/expressions/test_evaluator.py: ## @@ -52,112 +54,126 @@ ) +def _record_simple(id: int, data: Optional[str]) -> Record: # pylint: disable=redefined-builtin +r = Record(struct=SIMPLE_SCHEMA.as_struct()) +r[0] = id +r[1] = data +return r + + +def _record_float(id: float, f: float) -> Record: # pylint: disable=redefined-builtin +r = Record(struct=FLOAT_SCHEMA.as_struct()) +r[0] = id +r[1] = f +return r + + def test_true() -> None: evaluate = expression_evaluator(SIMPLE_SCHEMA, AlwaysTrue(), case_sensitive=True) -assert evaluate(Record(1, "a")) +assert evaluate(_record_simple(1, "a")) def test_false() -> None: evaluate = expression_evaluator(SIMPLE_SCHEMA, AlwaysFalse(), case_sensitive=True) -assert not evaluate(Record(1, "a")) +assert not evaluate(_record_simple(1, "a")) def test_less_than() -> None: evaluate = expression_evaluator(SIMPLE_SCHEMA, LessThan("id", 3), case_sensitive=True) -assert evaluate(Record(2, "a")) -assert not evaluate(Record(3, "a")) +assert evaluate(_record_simple(2, "a")) +assert not evaluate(_record_simple(3, "a")) def test_less_than_or_equal() -> None: evaluate = expression_evaluator(SIMPLE_SCHEMA, LessThanOrEqual("id", 3), case_sensitive=True) -assert evaluate(Record(1, "a")) -assert evaluate(Record(3, "a")) -assert not evaluate(Record(4, "a")) +assert evaluate(_record_simple(1, "a")) +assert evaluate(_record_simple(3, "a")) +assert not evaluate(_record_simple(4, "a")) def test_greater_than() -> None: evaluate = expression_evaluator(SIMPLE_SCHEMA, GreaterThan("id", 3), case_sensitive=True) -assert not evaluate(Record(1, "a")) -assert not evaluate(Record(3, "a")) -assert evaluate(Record(4, "a")) +assert not evaluate(_record_simple(1, "a")) +assert not evaluate(_record_simple(3, "a")) +assert evaluate(_record_simple(4, "a")) def test_greater_than_or_equal() -> None: evaluate = expression_evaluator(SIMPLE_SCHEMA, GreaterThanOrEqual("id", 3), case_sensitive=True) -assert not evaluate(Record(2, "a")) -assert evaluate(Record(3, "a")) -assert evaluate(Record(4, "a")) +assert not evaluate(_record_simple(2, "a")) +assert evaluate(_record_simple(3, "a")) +assert evaluate(_record_simple(4, "a")) def test_equal_to() -> None: evaluate = expression_evaluator(SIMPLE_SCHEMA, EqualTo("id", 3), case_sensitive=True) -assert not evaluate(Record(2, "a")) -assert evaluate(Record(3, "a")) -assert not evaluate(Record(4, "a")) +assert not evaluate(_record_simple(2, "a")) +assert evaluate(_record_simple(3, "a")) +assert not evaluate(_record_simple(4, "a")) def test_not_equal_to() -> None: evaluate = expression_evaluator(SIMPLE_SCHEMA, NotEqualTo("id", 3), case_sensitive=True) -assert evaluate(Record(2, "a")) -assert not evaluate(Record(3, "a")) -assert evaluate(Record(4, "a")) +assert evaluate(_record_simple(2, "a")) +assert not evaluate(_record_simple(3, "a")) +assert evaluate(_record_simple(4, "a")) def test_in() -> None: evaluate = expression_evaluator(SIMPLE_SCHEMA, In("id", [1, 2, 3]), case_sensitive=True) -assert evaluate(Record(2, "a")) -assert evaluate(Record(3, "a")) -assert not evaluate(Record(4, "a")) +assert evaluate(_record_simple(2, "a")) +assert evaluate(_record_simple(3, "a")) +assert not evaluate(_record_simple(4, "a")) def test_not_in() -> None: evaluate = expression_evaluator(SIMPLE_SCHEMA, NotIn("id", [1, 2, 3]), case_sensitive=True) -assert not evaluate(Record(2, "a")) -assert not evaluate(Record(3, "a")) -assert evaluate(Record(4, "a")) +assert not evaluate(_record_simple(2, "a")) +assert not evaluate(_record_simple(3, "a")) +assert evaluate(_record_simple(4, "a")) def test_is_null() -> None: evaluate = expression_evaluator(SIMPLE_SCHEMA, IsNull("data"), case_sensitive=True) -assert not evaluate(Record(2, "a")) -assert evaluate(Record(3, None)) +assert not evaluate(_record_simple(2, "a")) +assert evaluate(_record_simple(3, None)) def test_not_null() -> None: evaluate = expression_evaluator(SIMPLE_SCHEMA, NotNull("data"), case_sensitive=True) -assert evaluate(Record(2, "a")) -assert not evaluate(Record(3, None)) +assert evaluate(_record_simple(2, "a")) +assert not evaluate(_record_simple(3, None)) def test_is_nan() -> None: evaluate = expression_evaluator(FLOAT_SCHEMA, IsNaN("f"), case_sensitive=True) -assert not evaluate(Record(2, 0.0)) -assert not evaluate(Record(3, float("infinity"))) -assert evaluate(Record(4, float("nan"))) +assert not evaluate(_record_float(2, f=0.0)) +assert not evaluate(_record_float(3, f=float("in
[GitHub] [iceberg] jackye1995 commented on pull request #6576: AWS: Fix check for isTableRegisteredWithLF leading to CREATE table failure
jackye1995 commented on PR #6576: URL: https://github.com/apache/iceberg/pull/6576#issuecomment-1382243877 @xiaoxuandev can you take a look? I believe this case should have been covered in unit test, need to take a deeper look into it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] huaxingao opened a new pull request, #6582: Add a Spark procedure to collect NDV
huaxingao opened a new pull request, #6582: URL: https://github.com/apache/iceberg/pull/6582 Add a Spark procedure to collect NDV, which will be used for CBO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV
huaxingao commented on code in PR #6582: URL: https://github.com/apache/iceberg/pull/6582#discussion_r1069901978 ## core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java: ## @@ -26,4 +26,6 @@ private StandardBlobTypes() {} * href="https://datasketches.apache.org/";>Apache DataSketches library */ public static final String APACHE_DATASKETCHES_THETA_V1 = "apache-datasketches-theta-v1"; + + public static final String NDV_BLOB = "ndv-blob"; Review Comment: Spark doesn't use Apache DataSketches to collect approximate NDV, so I am adding a new blob type. Hope this is OK. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV
huaxingao commented on code in PR #6582: URL: https://github.com/apache/iceberg/pull/6582#discussion_r1069914737 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java: ## @@ -0,0 +1,191 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.procedures; + +import static java.lang.String.format; +import static java.util.UUID.randomUUID; +import static org.apache.hadoop.shaded.com.google.common.collect.ImmutableList.toImmutableList; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.nio.ByteBuffer; +import org.apache.iceberg.GenericBlobMetadata; +import org.apache.iceberg.GenericStatisticsFile; +import org.apache.iceberg.HasTableOperations; +import org.apache.iceberg.Table; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.io.OutputFile; +import org.apache.iceberg.puffin.Blob; +import org.apache.iceberg.puffin.Puffin; +import org.apache.iceberg.puffin.PuffinWriter; +import org.apache.iceberg.puffin.StandardBlobTypes; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.spark.source.SparkTable; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.util.ArrayData; +import org.apache.spark.sql.connector.catalog.Identifier; +import org.apache.spark.sql.connector.catalog.TableCatalog; +import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.unsafe.types.UTF8String; +import org.jetbrains.annotations.NotNull; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A procedure that gets approximate NDV (number of distinct value) for the requested columns + * and sets this to the table's StatisticsFile. + */ Review Comment: I am debating myself if I should collect ndv only or also collect everything else such as max, min, num_nulls etc. in ANALYZE TABLE. I will just collect ndv for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV
huaxingao commented on code in PR #6582: URL: https://github.com/apache/iceberg/pull/6582#discussion_r1069918267 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java: ## @@ -0,0 +1,191 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.procedures; + +import static java.lang.String.format; +import static java.util.UUID.randomUUID; +import static org.apache.hadoop.shaded.com.google.common.collect.ImmutableList.toImmutableList; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.nio.ByteBuffer; +import org.apache.iceberg.GenericBlobMetadata; +import org.apache.iceberg.GenericStatisticsFile; +import org.apache.iceberg.HasTableOperations; +import org.apache.iceberg.Table; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.io.OutputFile; +import org.apache.iceberg.puffin.Blob; +import org.apache.iceberg.puffin.Puffin; +import org.apache.iceberg.puffin.PuffinWriter; +import org.apache.iceberg.puffin.StandardBlobTypes; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.spark.source.SparkTable; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.util.ArrayData; +import org.apache.spark.sql.connector.catalog.Identifier; +import org.apache.spark.sql.connector.catalog.TableCatalog; +import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.unsafe.types.UTF8String; +import org.jetbrains.annotations.NotNull; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A procedure that gets approximate NDV (number of distinct value) for the requested columns + * and sets this to the table's StatisticsFile. + */ +public class DistinctCountProcedure extends BaseProcedure { + private static final Logger LOG = LoggerFactory.getLogger(DistinctCountProcedure.class); + + private static final ProcedureParameter[] PARAMETERS = + new ProcedureParameter[] { +ProcedureParameter.required("table", DataTypes.StringType), +ProcedureParameter.optional("distinct_count_view", DataTypes.StringType), +ProcedureParameter.optional("columns", STRING_ARRAY), + }; + + private static final StructType OUTPUT_TYPE = + new StructType( + new StructField[] { +new StructField("view_name", DataTypes.StringType, false, Metadata.empty()) + }); + + public static SparkProcedures.ProcedureBuilder builder() { +return new Builder() { + @Override + protected DistinctCountProcedure doBuild() { +return new DistinctCountProcedure(tableCatalog()); + } +}; + } + + private DistinctCountProcedure(TableCatalog tableCatalog) { +super(tableCatalog); + } + + @Override + public ProcedureParameter[] parameters() { +return PARAMETERS; + } + + @Override + public StructType outputType() { +return OUTPUT_TYPE; + } + + @Override + public InternalRow[] call(InternalRow args) { +String tableName = args.getString(0); +Identifier tableIdent = toIdentifier(tableName, PARAMETERS[0].name()); +SparkTable sparkTable = loadSparkTable(tableIdent); +StructType schema = sparkTable.schema(); +Table table = sparkTable.table(); +ArrayData columns = args.getArray(2); +int columnSizes = columns.numElements(); + +long[] ndvs = new long[columnSizes]; +int[] fieldId = new int[columnSizes]; +String query = "SELECT "; +for (int i = 0; i < columnSizes; i++) { + String colName = columns.getUTF8String(i).toString(); + query += "APPROX_COUNT_DISTINCT(" + colName + "), "; + fieldId[i] = schema.fieldIndex(colName); +} + +query = query.substring(0, query.length() - 2) + " FROM " + tableName; +Data
[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables
jackye1995 commented on code in PR #6428: URL: https://github.com/apache/iceberg/pull/6428#discussion_r1069937682 ## snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeTableMetadata.java: ## @@ -0,0 +1,150 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.snowflake; + +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import org.apache.iceberg.relocated.com.google.common.base.Objects; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.util.JsonUtil; + +class SnowflakeTableMetadata { + public static final Pattern SNOWFLAKE_AZURE_PATTERN = + Pattern.compile("azure://([^/]+)/([^/]+)/(.*)"); + + private final String snowflakeMetadataLocation; + private final String icebergMetadataLocation; + private final String status; + + // Note: Since not all sources will necessarily come from a raw JSON representation, this raw + // JSON should only be considered a convenient debugging field. Equality of two + // SnowflakeTableMetadata instances should not depend on equality of this field. + private final String rawJsonVal; + + SnowflakeTableMetadata( + String snowflakeMetadataLocation, + String icebergMetadataLocation, + String status, + String rawJsonVal) { +this.snowflakeMetadataLocation = snowflakeMetadataLocation; +this.icebergMetadataLocation = icebergMetadataLocation; +this.status = status; +this.rawJsonVal = rawJsonVal; + } + + /** Storage location of table metadata in Snowflake's path syntax. */ + public String snowflakeMetadataLocation() { +return snowflakeMetadataLocation; + } + + /** Storage location of table metadata in Iceberg's path syntax. */ + public String icebergMetadataLocation() { +return icebergMetadataLocation; + } + + public String getStatus() { +return status; + } + + @Override + public boolean equals(Object o) { +if (this == o) { + return true; +} else if (!(o instanceof SnowflakeTableMetadata)) { + return false; +} + +// Only consider parsed fields, not the raw JSON that may or may not be the original source of +// this instance. +SnowflakeTableMetadata that = (SnowflakeTableMetadata) o; +return Objects.equal(this.snowflakeMetadataLocation, that.snowflakeMetadataLocation) +&& Objects.equal(this.icebergMetadataLocation, that.icebergMetadataLocation) +&& Objects.equal(this.status, that.status); + } + + @Override + public int hashCode() { +return Objects.hashCode(snowflakeMetadataLocation, icebergMetadataLocation, status); + } + + @Override + public String toString() { +return String.format( +"snowflakeMetadataLocation: '%s', icebergMetadataLocation: '%s', status: '%s'", +snowflakeMetadataLocation, icebergMetadataLocation, status); + } + + public String toDebugString() { +return String.format("%s, rawJsonVal: %s", toString(), rawJsonVal); + } + + /** + * Translates from Snowflake's path syntax to Iceberg's path syntax for paths matching known + * non-compatible Snowflake paths. Throws IllegalArgumentException if the prefix of the + * snowflakeLocation is a known non-compatible path syntax but fails to match the expected path + * components for a successful translation. + */ + public static String snowflakeLocationToIcebergLocation(String snowflakeLocation) { Review Comment: Thanks for the clarification, I approved the changes, very excited to see this from Snowflake. > I've been meaning to open a discussion to see if anyone has thought about maybe adding some ease-of-use hooks for last-mile automatic basic path translations right before they go to FileIO resolution. For HadoopFileIO, i think this is not really needed because typically we see our users already configured HDFS settings to map schemes to whatever file system implementations they would like to use. I think ResolvingFileIO to some extent already does this kind of translation to some extent, maybe we can extend its functionality at that front. -- This is
[GitHub] [iceberg] JonasJ-ap opened a new pull request, #6179: AWS: Re-tag files when renaming tables in GlueCatalog
JonasJ-ap opened a new pull request, #6179: URL: https://github.com/apache/iceberg/pull/6179 Follows PR #4402 . As mentioned in https://github.com/apache/iceberg/pull/4402#issuecomment-1261096282: In `GlueCatalog`, if `s3.write.table-name-tag-enabled` and `s3.write.namespace-name-tag-enabled` are enabled, all the files related to the table will be tagged with the table name and the namespace name. We need to update these tags when we rename the table. This PR add S3 tag updates for `renameTable` operation in `GlueCatalog` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on pull request #6179: AWS: Re-tag files when renaming tables in GlueCatalog
jackye1995 commented on PR #6179: URL: https://github.com/apache/iceberg/pull/6179#issuecomment-1382318184 I'd like to discuss this a bit more, since we do have some actual customer use cases for this, because overall the S3 tagging related features in Iceberg integrate very well with S3 lifecycle policy and bucket permission, RENAME seems to be the only piece that is missing for the whole end to end flow to work. > I think that if a rename requires anything but a metadata operation, then the Iceberg catalog should not allow it and should throw an exception that rename is not supported. Totally agree. The reason I was relatively not against this approach was that RENAME is not a frequent operation for most users, and they are willing to wait long to complete the full rename if it can update related tags or other metadata information in storage. > Users aren't going to know that this is needs to perform a potentially huge number of S3 operations. Correct, maybe one way to resolve this is to throw exception by default and user has to turn on the feature to be aware of the implications. This logic will probably exist outside Iceberg repository if it is not contributed in. @rdblue please let me know if there is any intermediate grounds that could be taken for this use case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6582: Add a Spark procedure to collect NDV
RussellSpitzer commented on code in PR #6582: URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070017872 ## core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java: ## @@ -26,4 +26,6 @@ private StandardBlobTypes() {} * href="https://datasketches.apache.org/";>Apache DataSketches library */ public static final String APACHE_DATASKETCHES_THETA_V1 = "apache-datasketches-theta-v1"; + + public static final String NDV_BLOB = "ndv-blob"; Review Comment: @findepi What are you using for NDV stats here? I figure we should have a common blob type -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6582: Add a Spark procedure to collect NDV
RussellSpitzer commented on code in PR #6582: URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070020289 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java: ## @@ -0,0 +1,188 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.procedures; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.nio.ByteBuffer; +import java.util.UUID; +import org.apache.iceberg.GenericBlobMetadata; +import org.apache.iceberg.GenericStatisticsFile; +import org.apache.iceberg.HasTableOperations; +import org.apache.iceberg.Table; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.io.OutputFile; +import org.apache.iceberg.puffin.Blob; +import org.apache.iceberg.puffin.Puffin; +import org.apache.iceberg.puffin.PuffinWriter; +import org.apache.iceberg.puffin.StandardBlobTypes; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.spark.source.SparkTable; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.util.ArrayData; +import org.apache.spark.sql.connector.catalog.Identifier; +import org.apache.spark.sql.connector.catalog.TableCatalog; +import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.unsafe.types.UTF8String; +import org.jetbrains.annotations.NotNull; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A procedure that gets approximate NDV (number of distinct value) for the requested columns and + * sets this to the table's StatisticsFile. + */ +public class DistinctCountProcedure extends BaseProcedure { + private static final Logger LOG = LoggerFactory.getLogger(DistinctCountProcedure.class); + + private static final ProcedureParameter[] PARAMETERS = + new ProcedureParameter[] { +ProcedureParameter.required("table", DataTypes.StringType), +ProcedureParameter.optional("distinct_count_view", DataTypes.StringType), +ProcedureParameter.optional("columns", STRING_ARRAY), + }; + + private static final StructType OUTPUT_TYPE = + new StructType( + new StructField[] { +new StructField("view_name", DataTypes.StringType, false, Metadata.empty()) + }); + + public static SparkProcedures.ProcedureBuilder builder() { +return new Builder() { + @Override + protected DistinctCountProcedure doBuild() { +return new DistinctCountProcedure(tableCatalog()); + } +}; + } + + private DistinctCountProcedure(TableCatalog tableCatalog) { +super(tableCatalog); + } + + @Override + public ProcedureParameter[] parameters() { +return PARAMETERS; + } + + @Override + public StructType outputType() { +return OUTPUT_TYPE; + } + + @Override + public InternalRow[] call(InternalRow args) { +String tableName = args.getString(0); +Identifier tableIdent = toIdentifier(tableName, PARAMETERS[0].name()); +SparkTable sparkTable = loadSparkTable(tableIdent); +StructType schema = sparkTable.schema(); +Table table = sparkTable.table(); +ArrayData columns = args.getArray(2); +int columnSizes = columns.numElements(); + +long[] ndvs = new long[columnSizes]; +int[] fieldId = new int[columnSizes]; +String query = "SELECT "; +for (int i = 0; i < columnSizes; i++) { + String colName = columns.getUTF8String(i).toString(); + query += "APPROX_COUNT_DISTINCT(" + colName + "), "; Review Comment: Since we are technically not using distinct here, maybe we should be calling the procedure "analyze"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To
[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6582: Add a Spark procedure to collect NDV
RussellSpitzer commented on code in PR #6582: URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070023914 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java: ## @@ -0,0 +1,188 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.procedures; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.nio.ByteBuffer; +import java.util.UUID; +import org.apache.iceberg.GenericBlobMetadata; +import org.apache.iceberg.GenericStatisticsFile; +import org.apache.iceberg.HasTableOperations; +import org.apache.iceberg.Table; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.io.OutputFile; +import org.apache.iceberg.puffin.Blob; +import org.apache.iceberg.puffin.Puffin; +import org.apache.iceberg.puffin.PuffinWriter; +import org.apache.iceberg.puffin.StandardBlobTypes; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.spark.source.SparkTable; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.util.ArrayData; +import org.apache.spark.sql.connector.catalog.Identifier; +import org.apache.spark.sql.connector.catalog.TableCatalog; +import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.unsafe.types.UTF8String; +import org.jetbrains.annotations.NotNull; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A procedure that gets approximate NDV (number of distinct value) for the requested columns and + * sets this to the table's StatisticsFile. + */ +public class DistinctCountProcedure extends BaseProcedure { + private static final Logger LOG = LoggerFactory.getLogger(DistinctCountProcedure.class); + + private static final ProcedureParameter[] PARAMETERS = + new ProcedureParameter[] { +ProcedureParameter.required("table", DataTypes.StringType), +ProcedureParameter.optional("distinct_count_view", DataTypes.StringType), +ProcedureParameter.optional("columns", STRING_ARRAY), + }; + + private static final StructType OUTPUT_TYPE = + new StructType( + new StructField[] { +new StructField("view_name", DataTypes.StringType, false, Metadata.empty()) + }); + + public static SparkProcedures.ProcedureBuilder builder() { +return new Builder() { + @Override + protected DistinctCountProcedure doBuild() { +return new DistinctCountProcedure(tableCatalog()); + } +}; + } + + private DistinctCountProcedure(TableCatalog tableCatalog) { +super(tableCatalog); + } + + @Override + public ProcedureParameter[] parameters() { +return PARAMETERS; + } + + @Override + public StructType outputType() { +return OUTPUT_TYPE; + } + + @Override + public InternalRow[] call(InternalRow args) { +String tableName = args.getString(0); +Identifier tableIdent = toIdentifier(tableName, PARAMETERS[0].name()); +SparkTable sparkTable = loadSparkTable(tableIdent); +StructType schema = sparkTable.schema(); +Table table = sparkTable.table(); +ArrayData columns = args.getArray(2); +int columnSizes = columns.numElements(); + +long[] ndvs = new long[columnSizes]; +int[] fieldId = new int[columnSizes]; +String query = "SELECT "; +for (int i = 0; i < columnSizes; i++) { + String colName = columns.getUTF8String(i).toString(); + query += "APPROX_COUNT_DISTINCT(" + colName + "), "; + fieldId[i] = schema.fieldIndex(colName); +} + +query = query.substring(0, query.length() - 2) + " FROM " + tableName; +Dataset df = spark().sql(query); + +for (int i = 0; i < columnSizes; i++) { + ndvs[i] = df.head().getLong(i); +} + +TableOperations operatio
[GitHub] [iceberg] RussellSpitzer commented on pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file
RussellSpitzer commented on PR #6461: URL: https://github.com/apache/iceberg/pull/6461#issuecomment-1382374860 I'm still a little worried about saving this information, what does knowing the sort order mean for a file. Are we guaranteeing that the file is locally sorted by that order? or globally sorted with files added in that snapshot. I'm cautious about doing anything here until we really define what it means. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] pvary opened a new pull request, #6583: Flink: Refactor sink tests to use HadoopCatalogResource
pvary opened a new pull request, #6583: URL: https://github.com/apache/iceberg/pull/6583 Refactor Flink Sink tests to use the HadoopCatalogResource. This is a groundwork for adding encryption tests for Flink Sources and Sinks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] pvary commented on pull request #6583: Flink: Refactor sink tests to use HadoopCatalogResource
pvary commented on PR #6583: URL: https://github.com/apache/iceberg/pull/6583#issuecomment-1382415642 CC: @hililiwei, @ggershinsky -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6581: Spark 3.3: Add RemoveDanglingDeletes action
szehon-ho commented on code in PR #6581: URL: https://github.com/apache/iceberg/pull/6581#discussion_r1070090624 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java: ## @@ -0,0 +1,227 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.actions; + +import static org.apache.iceberg.MetadataTableType.ENTRIES; +import static org.apache.spark.sql.functions.min; + +import java.util.List; +import java.util.Map; +import org.apache.iceberg.DeleteFile; +import org.apache.iceberg.DeleteFiles; +import org.apache.iceberg.FileMetadata; +import org.apache.iceberg.PartitionData; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Partitioning; +import org.apache.iceberg.Table; +import org.apache.iceberg.actions.RemoveDanglingDeleteFiles; +import org.apache.iceberg.actions.RemoveDanglingDeleteFilesActionResult; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.spark.JobGroupInfo; +import org.apache.iceberg.types.Types; +import org.apache.spark.api.java.function.MapFunction; +import org.apache.spark.sql.Column; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Encoders; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SparkSession; +import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema; + +/** + * An action that removes dangling delete files from the current snapshot. A delete file is dangling + * if its deletes no longer applies to any data file. + * + * The following dangling delete files are removed: + * + * + * Position delete files with a sequence number less than that of any data file in the same + * partition + * Equality delete files with a sequence number less than or equal to that of any data file in + * the same partition + * + */ +public class RemoveDanglingDeletesSparkAction +extends BaseSnapshotUpdateSparkAction +implements RemoveDanglingDeleteFiles { + + private final Table table; + + protected RemoveDanglingDeletesSparkAction(SparkSession spark, Table table) { +super(spark); +this.table = table; + } + + @Override + protected RemoveDanglingDeletesSparkAction self() { +return this; + } + + @Override + public Result execute() { +if (table.specs().size() == 1 && table.spec().isUnpartitioned()) { + // ManifestFilterManager already performs this table-wide on each commit + return RemoveDanglingDeleteFilesActionResult.empty(); +} + +String desc = String.format("Remove dangling delete files for %s", table.name()); +JobGroupInfo info = newJobGroupInfo("REWRITE-MANIFESTS", desc); +return withJobGroupInfo(info, this::doExecute); + } + + private Result doExecute() { +Dataset entries = +loadMetadataTable(table, ENTRIES) +.filter("status < 2") // live entries +.selectExpr( +"data_file.partition as partition", +"data_file.spec_id as spec_id", +"data_file.file_path as file_path", +"data_file.content as content", +"data_file.file_size_in_bytes as file_size_in_bytes", +"data_file.record_count as record_count", +"sequence_number"); + +DeleteFiles deleteFiles = table.newDelete(); +List toRemove = withReusableDS(entries, this::danglingDeletes); +toRemove.forEach(deleteFiles::deleteFile); +deleteFiles.commit(); + +return new RemoveDanglingDeleteFilesActionResult(toRemove); + } + + /** + * Calculate dangling delete files + * + * + * Group all files by partition, calculate the minimum data file sequence number in each. + * For each partition, check if any position delete files have a sequence number less than + * the partition's min_data_sequence_number + * For each partition, check if any equality delete files have a sequence number less than + * or equal to the partition's min_data_sequence_number + * + * + * @param entries dataset of file entries, marked by content (0 for data, 1 for posDeletes, 2 for + * eqDeletes) + * @return list of dangling delete f
[GitHub] [iceberg] huaxingao commented on issue #6549: Collecting Iceberg NDV Statistics for Spark Engine
huaxingao commented on issue #6549: URL: https://github.com/apache/iceberg/issues/6549#issuecomment-1382430906 Here is the [PR](https://github.com/apache/iceberg/pull/6582) for implementing a Spark stored procedure to collect NDV. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV
huaxingao commented on code in PR #6582: URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070099281 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java: ## @@ -0,0 +1,188 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.procedures; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.nio.ByteBuffer; +import java.util.UUID; +import org.apache.iceberg.GenericBlobMetadata; +import org.apache.iceberg.GenericStatisticsFile; +import org.apache.iceberg.HasTableOperations; +import org.apache.iceberg.Table; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.io.OutputFile; +import org.apache.iceberg.puffin.Blob; +import org.apache.iceberg.puffin.Puffin; +import org.apache.iceberg.puffin.PuffinWriter; +import org.apache.iceberg.puffin.StandardBlobTypes; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.spark.source.SparkTable; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.util.ArrayData; +import org.apache.spark.sql.connector.catalog.Identifier; +import org.apache.spark.sql.connector.catalog.TableCatalog; +import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.unsafe.types.UTF8String; +import org.jetbrains.annotations.NotNull; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A procedure that gets approximate NDV (number of distinct value) for the requested columns and + * sets this to the table's StatisticsFile. + */ +public class DistinctCountProcedure extends BaseProcedure { + private static final Logger LOG = LoggerFactory.getLogger(DistinctCountProcedure.class); + + private static final ProcedureParameter[] PARAMETERS = + new ProcedureParameter[] { +ProcedureParameter.required("table", DataTypes.StringType), +ProcedureParameter.optional("distinct_count_view", DataTypes.StringType), +ProcedureParameter.optional("columns", STRING_ARRAY), + }; + + private static final StructType OUTPUT_TYPE = + new StructType( + new StructField[] { +new StructField("view_name", DataTypes.StringType, false, Metadata.empty()) + }); + + public static SparkProcedures.ProcedureBuilder builder() { +return new Builder() { + @Override + protected DistinctCountProcedure doBuild() { +return new DistinctCountProcedure(tableCatalog()); + } +}; + } + + private DistinctCountProcedure(TableCatalog tableCatalog) { +super(tableCatalog); + } + + @Override + public ProcedureParameter[] parameters() { +return PARAMETERS; + } + + @Override + public StructType outputType() { +return OUTPUT_TYPE; + } + + @Override + public InternalRow[] call(InternalRow args) { +String tableName = args.getString(0); +Identifier tableIdent = toIdentifier(tableName, PARAMETERS[0].name()); +SparkTable sparkTable = loadSparkTable(tableIdent); +StructType schema = sparkTable.schema(); +Table table = sparkTable.table(); +ArrayData columns = args.getArray(2); +int columnSizes = columns.numElements(); + +long[] ndvs = new long[columnSizes]; +int[] fieldId = new int[columnSizes]; +String query = "SELECT "; +for (int i = 0; i < columnSizes; i++) { + String colName = columns.getUTF8String(i).toString(); + query += "APPROX_COUNT_DISTINCT(" + colName + "), "; Review Comment: I will change the procedure name from `DistinctCountProcedure` to `AnalyzeTableProcedure` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-m
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation
amogh-jahagirdar commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070101248 ## core/src/main/java/org/apache/iceberg/view/ViewHistoryEntryParser.java: ## @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.util.JsonUtil; + +class ViewHistoryEntryParser { + + private ViewHistoryEntryParser() {} + + static final String VERSION_ID = "version-id"; + static final String TIMESTAMP_MS = "timestamp-ms"; + + static String toJson(ViewHistoryEntry entry) { +return JsonUtil.generate(gen -> toJson(entry, gen), false); + } + + static void toJson(ViewHistoryEntry entry, JsonGenerator generator) throws IOException { +generator.writeStartObject(); +generator.writeNumberField(TIMESTAMP_MS, entry.timestampMillis()); +generator.writeNumberField(VERSION_ID, entry.versionId()); +generator.writeEndObject(); + } + + static ViewHistoryEntry fromJson(String json) { +Preconditions.checkArgument( +json != null && !json.isEmpty(), +"Cannot parse view history entry from invalid JSON: %s", +json); +return JsonUtil.parse(json, ViewHistoryEntryParser::fromJson); + } + + static ViewHistoryEntry fromJson(JsonNode node) { +return new BaseViewHistoryEntry( Review Comment: Separated the precondition checks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation
amogh-jahagirdar commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070101425 ## core/src/test/java/org/apache/iceberg/view/TestViewHistoryEntryParser.java: ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import org.junit.Assert; +import org.junit.Test; + +public class TestViewHistoryEntryParser { + + @Test Review Comment: Good point, I've added more tests around null,empty, missing fields -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation
amogh-jahagirdar commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070101669 ## core/src/main/java/org/apache/iceberg/view/ViewHistoryEntryParser.java: ## @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.util.JsonUtil; + +class ViewHistoryEntryParser { + + private ViewHistoryEntryParser() {} + + static final String VERSION_ID = "version-id"; + static final String TIMESTAMP_MS = "timestamp-ms"; + + static String toJson(ViewHistoryEntry entry) { +return JsonUtil.generate(gen -> toJson(entry, gen), false); + } + + static void toJson(ViewHistoryEntry entry, JsonGenerator generator) throws IOException { +generator.writeStartObject(); Review Comment: Sure, updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on pull request #6559: Core: View core parser implementations
amogh-jahagirdar commented on PR #6559: URL: https://github.com/apache/iceberg/pull/6559#issuecomment-1382437507 Thanks @nastra I'll be taking these suggestions in all the split PRs I'm raising. Agreed, more tests on nullability/missing fields are needed, and now that we use Immutable dependency in the metrics implementation, we have a good precedent to use it here as well which will simplify a lot of the boilerplate code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar closed pull request #6559: Core: View core parser implementations
amogh-jahagirdar closed pull request #6559: Core: View core parser implementations URL: https://github.com/apache/iceberg/pull/6559 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6565: Core: View history entry core implementation
rdblue commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070122674 ## core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java: ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import org.immutables.value.Value; + +@Value.Immutable +public interface BaseViewHistoryEntry extends ViewHistoryEntry { Review Comment: I think we can drop `Base` since this isn't a base class. We use that when we expect something to extend the class. But this is an interface that will result in an `ImmutableViewHistoryEntry`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on pull request #6565: Core: View history entry core implementation
rdblue commented on PR #6565: URL: https://github.com/apache/iceberg/pull/6565#issuecomment-1382456468 Looks good other than the name of the history entry interface. Thanks, @amogh-jahagirdar! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] flyrain commented on a diff in pull request #6582: Add a Spark procedure to collect NDV
flyrain commented on code in PR #6582: URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070140194 ## core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java: ## @@ -26,4 +26,6 @@ private StandardBlobTypes() {} * href="https://datasketches.apache.org/";>Apache DataSketches library */ public static final String APACHE_DATASKETCHES_THETA_V1 = "apache-datasketches-theta-v1"; + + public static final String NDV_BLOB = "ndv-blob"; Review Comment: We should use one blob type for NDV ideally, although Spark doesn't have the sketch data. I'm also curious how sketch data is useful for a table level metric. It is absolutely useful for file-level and partition-level since we can merge them later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] stevenzwu opened a new pull request, #6584: Flink: support reading as Avro GenericRecord for FLIP-27 IcebergSource
stevenzwu opened a new pull request, #6584: URL: https://github.com/apache/iceberg/pull/6584 cc @hililiwei -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation
amogh-jahagirdar commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070149483 ## core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java: ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import org.immutables.value.Value; + +@Value.Immutable +public interface BaseViewHistoryEntry extends ViewHistoryEntry { Review Comment: Sure let me see about a better name, the underlying interface is already ViewHistoryEntry. I think technically we could just need to add the immutable to ViewHistoryEntry but maybe its better to have an explicit core implementation which uses the annotation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation
amogh-jahagirdar commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070149483 ## core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java: ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import org.immutables.value.Value; + +@Value.Immutable +public interface BaseViewHistoryEntry extends ViewHistoryEntry { Review Comment: Sure let me see about a better name, the underlying interface is already ViewHistoryEntry. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6582: Add a Spark procedure to collect NDV
szehon-ho commented on code in PR #6582: URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070150005 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java: ## @@ -0,0 +1,188 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.procedures; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.nio.ByteBuffer; +import java.util.UUID; +import org.apache.iceberg.GenericBlobMetadata; +import org.apache.iceberg.GenericStatisticsFile; +import org.apache.iceberg.HasTableOperations; +import org.apache.iceberg.Table; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.io.OutputFile; +import org.apache.iceberg.puffin.Blob; +import org.apache.iceberg.puffin.Puffin; +import org.apache.iceberg.puffin.PuffinWriter; +import org.apache.iceberg.puffin.StandardBlobTypes; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.spark.source.SparkTable; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.util.ArrayData; +import org.apache.spark.sql.connector.catalog.Identifier; +import org.apache.spark.sql.connector.catalog.TableCatalog; +import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.unsafe.types.UTF8String; +import org.jetbrains.annotations.NotNull; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A procedure that gets approximate NDV (number of distinct value) for the requested columns and + * sets this to the table's StatisticsFile. + */ +public class DistinctCountProcedure extends BaseProcedure { + private static final Logger LOG = LoggerFactory.getLogger(DistinctCountProcedure.class); + + private static final ProcedureParameter[] PARAMETERS = + new ProcedureParameter[] { +ProcedureParameter.required("table", DataTypes.StringType), +ProcedureParameter.optional("distinct_count_view", DataTypes.StringType), +ProcedureParameter.optional("columns", STRING_ARRAY), + }; + + private static final StructType OUTPUT_TYPE = + new StructType( + new StructField[] { +new StructField("view_name", DataTypes.StringType, false, Metadata.empty()) + }); + + public static SparkProcedures.ProcedureBuilder builder() { +return new Builder() { + @Override + protected DistinctCountProcedure doBuild() { +return new DistinctCountProcedure(tableCatalog()); + } +}; + } + + private DistinctCountProcedure(TableCatalog tableCatalog) { +super(tableCatalog); + } + + @Override + public ProcedureParameter[] parameters() { +return PARAMETERS; + } + + @Override + public StructType outputType() { +return OUTPUT_TYPE; + } + + @Override + public InternalRow[] call(InternalRow args) { +String tableName = args.getString(0); +Identifier tableIdent = toIdentifier(tableName, PARAMETERS[0].name()); +SparkTable sparkTable = loadSparkTable(tableIdent); +StructType schema = sparkTable.schema(); +Table table = sparkTable.table(); +ArrayData columns = args.getArray(2); +int columnSizes = columns.numElements(); + +long[] ndvs = new long[columnSizes]; +int[] fieldId = new int[columnSizes]; +String query = "SELECT "; +for (int i = 0; i < columnSizes; i++) { + String colName = columns.getUTF8String(i).toString(); + query += "APPROX_COUNT_DISTINCT(" + colName + "), "; + fieldId[i] = schema.fieldIndex(colName); +} + +query = query.substring(0, query.length() - 2) + " FROM " + tableName; +Dataset df = spark().sql(query); + +for (int i = 0; i < columnSizes; i++) { + ndvs[i] = df.head().getLong(i); +} + +TableOperations operations =
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation
amogh-jahagirdar commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070149483 ## core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java: ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import org.immutables.value.Value; + +@Value.Immutable +public interface BaseViewHistoryEntry extends ViewHistoryEntry { Review Comment: Sure let me see about a better name, the underlying interface is already ViewHistoryEntry. ViewLogEntry comes to mind. This is analagous to SnapshotLogEntry implementing HistoryEntry for tables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation
amogh-jahagirdar commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070149483 ## core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java: ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import org.immutables.value.Value; + +@Value.Immutable +public interface BaseViewHistoryEntry extends ViewHistoryEntry { Review Comment: Sure let me see about a better name, the underlying interface is already ViewHistoryEntry. ViewLogEntry comes to mind. This is analogous to SnapshotLogEntry implementing HistoryEntry for tables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables
rdblue commented on code in PR #6428: URL: https://github.com/apache/iceberg/pull/6428#discussion_r1070158086 ## snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java: ## @@ -0,0 +1,248 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.snowflake; + +import java.io.Closeable; +import java.io.IOException; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; +import org.apache.iceberg.BaseMetastoreCatalog; +import org.apache.iceberg.CatalogProperties; +import org.apache.iceberg.CatalogUtil; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.catalog.Namespace; +import org.apache.iceberg.catalog.SupportsNamespaces; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.NoSuchNamespaceException; +import org.apache.iceberg.hadoop.Configurable; +import org.apache.iceberg.io.CloseableGroup; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.jdbc.JdbcClientPool; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class SnowflakeCatalog extends BaseMetastoreCatalog +implements Closeable, SupportsNamespaces, Configurable { + private static final String DEFAULT_CATALOG_NAME = "snowflake_catalog"; + private static final String DEFAULT_FILE_IO_IMPL = "org.apache.iceberg.io.ResolvingFileIO"; + + static class FileIOFactory { +public FileIO newFileIO(String impl, Map properties, Object hadoopConf) { + return CatalogUtil.loadFileIO(impl, properties, hadoopConf); Review Comment: This factory seems odd since it has no state that is passed through to the `loadFileIO` method. Couldn't this just call `loadFileIO` directly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6573: Docs: Add information on how to read from branches and tags in Spark docs
amogh-jahagirdar commented on code in PR #6573: URL: https://github.com/apache/iceberg/pull/6573#discussion_r1070158484 ## docs/spark-queries.md: ## @@ -126,6 +126,8 @@ To select a specific table snapshot or the snapshot at some time in the DataFram * `snapshot-id` selects a specific table snapshot * `as-of-timestamp` selects the current snapshot at a timestamp, in milliseconds +* `branch` selects the head snapshot of the specified branch. Note that currently branch cannot be combined with as-of-timestamp. +* `tag` selects the snapshot associated with the specified tag Review Comment: Definitely agree on having a SQL example once #6575 gets merged. For combining as-of-timestamp with tag I felt that was apparent since a tag can only map to a single snapshot which conflicts with passing in a timestamp, where as branch + time travel is a different case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation
amogh-jahagirdar commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070149483 ## core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java: ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import org.immutables.value.Value; + +@Value.Immutable +public interface BaseViewHistoryEntry extends ViewHistoryEntry { Review Comment: Sure let me see about a better name, the underlying interface is already ViewHistoryEntry. I would add to the underlying interface directly but then that changes the class path for implementation to be in API which goes against the existing pattern? ViewLogEntry comes to mind. This is analogous to SnapshotLogEntry implementing HistoryEntry for tables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation
jackye1995 commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070159167 ## core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java: ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import org.immutables.value.Value; + +@Value.Immutable +public interface BaseViewHistoryEntry extends ViewHistoryEntry { Review Comment: maybe a dumb question, why donβt we just add immutable annotation on the interface in that case? ``` @Value.Immutable public interface ViewLogEntry extends ViewHistoryEntry { long timestampMillis(); int versionId(); } ``` ``` public interface ViewHistoryEntry { /** Return the timestamp in milliseconds of the change */ long timestampMillis(); /** Return ID of the new current version */ int versionId(); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation
jackye1995 commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070159167 ## core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java: ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import org.immutables.value.Value; + +@Value.Immutable +public interface BaseViewHistoryEntry extends ViewHistoryEntry { Review Comment: maybe a dumb question, why donβt we just add immutable annotation on the interface in that case? With this, we have: ``` @Value.Immutable public interface ViewLogEntry extends ViewHistoryEntry { long timestampMillis(); int versionId(); } ``` ``` public interface ViewHistoryEntry { /** Return the timestamp in milliseconds of the change */ long timestampMillis(); /** Return ID of the new current version */ int versionId(); } ``` ## core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java: ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import org.immutables.value.Value; + +@Value.Immutable +public interface BaseViewHistoryEntry extends ViewHistoryEntry { Review Comment: maybe a dumb question, why donβt we just add immutable annotation on the interface in that case? Currently we have: ``` @Value.Immutable public interface ViewLogEntry extends ViewHistoryEntry { long timestampMillis(); int versionId(); } ``` ``` public interface ViewHistoryEntry { /** Return the timestamp in milliseconds of the change */ long timestampMillis(); /** Return ID of the new current version */ int versionId(); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation
amogh-jahagirdar commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070161504 ## core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java: ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import org.immutables.value.Value; + +@Value.Immutable +public interface BaseViewHistoryEntry extends ViewHistoryEntry { Review Comment: yeah nvm on the classpath issue I mentioned, I confused myself. We can just add to the base interface and eliminate the duplicate interface -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation
amogh-jahagirdar commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1070161504 ## core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java: ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import org.immutables.value.Value; + +@Value.Immutable +public interface BaseViewHistoryEntry extends ViewHistoryEntry { Review Comment: Not a dumb question, nvm on the classpath issue I mentioned, I confused myself. We can just mark the base interface as immutable and eliminate the duplicate interface. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] dennishuo commented on a diff in pull request #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables
dennishuo commented on code in PR #6428: URL: https://github.com/apache/iceberg/pull/6428#discussion_r1070164244 ## snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java: ## @@ -0,0 +1,248 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.snowflake; + +import java.io.Closeable; +import java.io.IOException; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; +import org.apache.iceberg.BaseMetastoreCatalog; +import org.apache.iceberg.CatalogProperties; +import org.apache.iceberg.CatalogUtil; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.catalog.Namespace; +import org.apache.iceberg.catalog.SupportsNamespaces; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.NoSuchNamespaceException; +import org.apache.iceberg.hadoop.Configurable; +import org.apache.iceberg.io.CloseableGroup; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.jdbc.JdbcClientPool; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class SnowflakeCatalog extends BaseMetastoreCatalog +implements Closeable, SupportsNamespaces, Configurable { + private static final String DEFAULT_CATALOG_NAME = "snowflake_catalog"; + private static final String DEFAULT_FILE_IO_IMPL = "org.apache.iceberg.io.ResolvingFileIO"; + + static class FileIOFactory { +public FileIO newFileIO(String impl, Map properties, Object hadoopConf) { + return CatalogUtil.loadFileIO(impl, properties, hadoopConf); Review Comment: This was just extracted to preserve the flow of unittest setup to be able to return the pre-configured InMemoryFileIO fake instance rather than having dynamic classloading get an empty one. The alternative there would've been to introduce some static global state in InMemoryFileIO, but that seems prone to causing cross-test-case problems. I could add a comment or annotation here to explain its existence if that'd make it cleaner. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] dennishuo commented on a diff in pull request #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables
dennishuo commented on code in PR #6428: URL: https://github.com/apache/iceberg/pull/6428#discussion_r1070166470 ## snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java: ## @@ -0,0 +1,248 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.snowflake; + +import java.io.Closeable; +import java.io.IOException; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; +import org.apache.iceberg.BaseMetastoreCatalog; +import org.apache.iceberg.CatalogProperties; +import org.apache.iceberg.CatalogUtil; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.catalog.Namespace; +import org.apache.iceberg.catalog.SupportsNamespaces; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.NoSuchNamespaceException; +import org.apache.iceberg.hadoop.Configurable; +import org.apache.iceberg.io.CloseableGroup; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.jdbc.JdbcClientPool; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class SnowflakeCatalog extends BaseMetastoreCatalog +implements Closeable, SupportsNamespaces, Configurable { + private static final String DEFAULT_CATALOG_NAME = "snowflake_catalog"; + private static final String DEFAULT_FILE_IO_IMPL = "org.apache.iceberg.io.ResolvingFileIO"; + + static class FileIOFactory { +public FileIO newFileIO(String impl, Map properties, Object hadoopConf) { + return CatalogUtil.loadFileIO(impl, properties, hadoopConf); Review Comment: Added code comment to clarify -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] williamhyun opened a new pull request, #6585: Update ORC to 1.8.2
williamhyun opened a new pull request, #6585: URL: https://github.com/apache/iceberg/pull/6585 Apache ORC 1.8.2 is the latest version of ORC which brings the following changes and bug fixes including an SBOM. - https://github.com/apache/orc/releases/tag/v1.8.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] github-actions[bot] commented on issue #5183: Allow to configure Avro block size
github-actions[bot] commented on issue #5183: URL: https://github.com/apache/iceberg/issues/5183#issuecomment-1382598801 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] github-actions[bot] commented on issue #4607: [Docs] Create an item list for re-organizing docs to the proposed layout
github-actions[bot] commented on issue #4607: URL: https://github.com/apache/iceberg/issues/4607#issuecomment-1382598842 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] github-actions[bot] closed issue #5183: Allow to configure Avro block size
github-actions[bot] closed issue #5183: Allow to configure Avro block size URL: https://github.com/apache/iceberg/issues/5183 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] dramaticlly commented on a diff in pull request #6574: Python: Raise exception on deletes
dramaticlly commented on code in PR #6574: URL: https://github.com/apache/iceberg/pull/6574#discussion_r1070176238 ## python/pyiceberg/table/__init__.py: ## @@ -341,7 +346,18 @@ def plan_files(self) -> Iterator[FileScanTask]: all_files = files(io.new_input(manifest.manifest_path)) matching_partition_files = filter(partition_filter, all_files) -yield from (FileScanTask(file) for file in matching_partition_files) +def _check_content(file: ManifestFile) -> ManifestFile: +try: +if file.content == ManifestContent.DELETES: +raise ValueError("PyIceberg does not support deletes: https://github.com/apache/iceberg/issues/6568";) Review Comment: shall we have a test to cover this ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV
huaxingao commented on code in PR #6582: URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070181796 ## core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java: ## @@ -26,4 +26,6 @@ private StandardBlobTypes() {} * href="https://datasketches.apache.org/";>Apache DataSketches library */ public static final String APACHE_DATASKETCHES_THETA_V1 = "apache-datasketches-theta-v1"; + + public static final String NDV_BLOB = "ndv-blob"; Review Comment: @szehon-ho Right, we should have a better name for this. I am not sure if we can have a common blob type here. I will wait for @findepi 's input before changing this one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV
huaxingao commented on code in PR #6582: URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070181906 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java: ## @@ -0,0 +1,191 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.procedures; + +import static java.lang.String.format; +import static java.util.UUID.randomUUID; +import static org.apache.hadoop.shaded.com.google.common.collect.ImmutableList.toImmutableList; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.nio.ByteBuffer; +import org.apache.iceberg.GenericBlobMetadata; +import org.apache.iceberg.GenericStatisticsFile; +import org.apache.iceberg.HasTableOperations; +import org.apache.iceberg.Table; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.io.OutputFile; +import org.apache.iceberg.puffin.Blob; +import org.apache.iceberg.puffin.Puffin; +import org.apache.iceberg.puffin.PuffinWriter; +import org.apache.iceberg.puffin.StandardBlobTypes; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.spark.source.SparkTable; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.util.ArrayData; +import org.apache.spark.sql.connector.catalog.Identifier; +import org.apache.spark.sql.connector.catalog.TableCatalog; +import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.unsafe.types.UTF8String; +import org.jetbrains.annotations.NotNull; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A procedure that gets approximate NDV (number of distinct value) for the requested columns + * and sets this to the table's StatisticsFile. + */ Review Comment: We have file level metadata for max, min, num_nulls etc. That's why I was hesitate to include those here. We don't have file level ndv, though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV
huaxingao commented on code in PR #6582: URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070181959 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java: ## @@ -0,0 +1,188 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.procedures; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.nio.ByteBuffer; +import java.util.UUID; +import org.apache.iceberg.GenericBlobMetadata; +import org.apache.iceberg.GenericStatisticsFile; +import org.apache.iceberg.HasTableOperations; +import org.apache.iceberg.Table; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.io.OutputFile; +import org.apache.iceberg.puffin.Blob; +import org.apache.iceberg.puffin.Puffin; +import org.apache.iceberg.puffin.PuffinWriter; +import org.apache.iceberg.puffin.StandardBlobTypes; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.spark.source.SparkTable; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.util.ArrayData; +import org.apache.spark.sql.connector.catalog.Identifier; +import org.apache.spark.sql.connector.catalog.TableCatalog; +import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.unsafe.types.UTF8String; +import org.jetbrains.annotations.NotNull; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A procedure that gets approximate NDV (number of distinct value) for the requested columns and + * sets this to the table's StatisticsFile. + */ +public class DistinctCountProcedure extends BaseProcedure { + private static final Logger LOG = LoggerFactory.getLogger(DistinctCountProcedure.class); + + private static final ProcedureParameter[] PARAMETERS = + new ProcedureParameter[] { +ProcedureParameter.required("table", DataTypes.StringType), +ProcedureParameter.optional("distinct_count_view", DataTypes.StringType), +ProcedureParameter.optional("columns", STRING_ARRAY), + }; + + private static final StructType OUTPUT_TYPE = + new StructType( + new StructField[] { +new StructField("view_name", DataTypes.StringType, false, Metadata.empty()) + }); + + public static SparkProcedures.ProcedureBuilder builder() { +return new Builder() { + @Override + protected DistinctCountProcedure doBuild() { +return new DistinctCountProcedure(tableCatalog()); + } +}; + } + + private DistinctCountProcedure(TableCatalog tableCatalog) { +super(tableCatalog); + } + + @Override + public ProcedureParameter[] parameters() { +return PARAMETERS; + } + + @Override + public StructType outputType() { +return OUTPUT_TYPE; + } + + @Override + public InternalRow[] call(InternalRow args) { +String tableName = args.getString(0); +Identifier tableIdent = toIdentifier(tableName, PARAMETERS[0].name()); +SparkTable sparkTable = loadSparkTable(tableIdent); +StructType schema = sparkTable.schema(); +Table table = sparkTable.table(); +ArrayData columns = args.getArray(2); +int columnSizes = columns.numElements(); + +long[] ndvs = new long[columnSizes]; +int[] fieldId = new int[columnSizes]; +String query = "SELECT "; +for (int i = 0; i < columnSizes; i++) { + String colName = columns.getUTF8String(i).toString(); + query += "APPROX_COUNT_DISTINCT(" + colName + "), "; + fieldId[i] = schema.fieldIndex(colName); +} + +query = query.substring(0, query.length() - 2) + " FROM " + tableName; +Dataset df = spark().sql(query); + +for (int i = 0; i < columnSizes; i++) { + ndvs[i] = df.head().getLong(i); +} + +TableOperations operations =
[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV
huaxingao commented on code in PR #6582: URL: https://github.com/apache/iceberg/pull/6582#discussion_r1070182092 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java: ## @@ -0,0 +1,188 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.procedures; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.nio.ByteBuffer; +import java.util.UUID; +import org.apache.iceberg.GenericBlobMetadata; +import org.apache.iceberg.GenericStatisticsFile; +import org.apache.iceberg.HasTableOperations; +import org.apache.iceberg.Table; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.io.OutputFile; +import org.apache.iceberg.puffin.Blob; +import org.apache.iceberg.puffin.Puffin; +import org.apache.iceberg.puffin.PuffinWriter; +import org.apache.iceberg.puffin.StandardBlobTypes; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.spark.source.SparkTable; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.util.ArrayData; +import org.apache.spark.sql.connector.catalog.Identifier; +import org.apache.spark.sql.connector.catalog.TableCatalog; +import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.unsafe.types.UTF8String; +import org.jetbrains.annotations.NotNull; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A procedure that gets approximate NDV (number of distinct value) for the requested columns and + * sets this to the table's StatisticsFile. + */ +public class DistinctCountProcedure extends BaseProcedure { + private static final Logger LOG = LoggerFactory.getLogger(DistinctCountProcedure.class); + + private static final ProcedureParameter[] PARAMETERS = + new ProcedureParameter[] { +ProcedureParameter.required("table", DataTypes.StringType), +ProcedureParameter.optional("distinct_count_view", DataTypes.StringType), +ProcedureParameter.optional("columns", STRING_ARRAY), + }; + + private static final StructType OUTPUT_TYPE = + new StructType( + new StructField[] { +new StructField("view_name", DataTypes.StringType, false, Metadata.empty()) + }); + + public static SparkProcedures.ProcedureBuilder builder() { +return new Builder() { + @Override + protected DistinctCountProcedure doBuild() { +return new DistinctCountProcedure(tableCatalog()); + } +}; + } + + private DistinctCountProcedure(TableCatalog tableCatalog) { +super(tableCatalog); + } + + @Override + public ProcedureParameter[] parameters() { +return PARAMETERS; + } + + @Override + public StructType outputType() { +return OUTPUT_TYPE; + } + + @Override + public InternalRow[] call(InternalRow args) { +String tableName = args.getString(0); +Identifier tableIdent = toIdentifier(tableName, PARAMETERS[0].name()); +SparkTable sparkTable = loadSparkTable(tableIdent); +StructType schema = sparkTable.schema(); +Table table = sparkTable.table(); +ArrayData columns = args.getArray(2); +int columnSizes = columns.numElements(); + +long[] ndvs = new long[columnSizes]; +int[] fieldId = new int[columnSizes]; +String query = "SELECT "; +for (int i = 0; i < columnSizes; i++) { + String colName = columns.getUTF8String(i).toString(); + query += "APPROX_COUNT_DISTINCT(" + colName + "), "; + fieldId[i] = schema.fieldIndex(colName); +} + +query = query.substring(0, query.length() - 2) + " FROM " + tableName; +Dataset df = spark().sql(query); + +for (int i = 0; i < columnSizes; i++) { + ndvs[i] = df.head().getLong(i); +} + +TableOperations operations =
[GitHub] [iceberg] jackye1995 commented on pull request #6565: Core: View history entry core implementation
jackye1995 commented on PR #6565: URL: https://github.com/apache/iceberg/pull/6565#issuecomment-1382622527 Since the immutable is addressed and we have enough approvals, I will go ahead to merge the PR, thanks for the review @rdblue and @nastra ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 merged pull request #6565: Core: View history entry core implementation
jackye1995 merged PR #6565: URL: https://github.com/apache/iceberg/pull/6565 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on pull request #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables
jackye1995 commented on PR #6428: URL: https://github.com/apache/iceberg/pull/6428#issuecomment-1382624709 Looks like some CI tests are failing? Could you check? Maybe need to rebase. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6573: Docs: Add information on how to read from branches and tags in Spark docs
jackye1995 commented on code in PR #6573: URL: https://github.com/apache/iceberg/pull/6573#discussion_r1070189868 ## docs/spark-queries.md: ## @@ -126,6 +126,8 @@ To select a specific table snapshot or the snapshot at some time in the DataFram * `snapshot-id` selects a specific table snapshot * `as-of-timestamp` selects the current snapshot at a timestamp, in milliseconds +* `branch` selects the head snapshot of the specified branch. Note that currently branch cannot be combined with as-of-timestamp. +* `tag` selects the snapshot associated with the specified tag Review Comment: Given that is a syntax change, I am waiting for more time for others to take a look. I think we can first merge this one and add that later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on pull request #6573: Docs: Add information on how to read from branches and tags in Spark docs
jackye1995 commented on PR #6573: URL: https://github.com/apache/iceberg/pull/6573#issuecomment-1382625570 Thanks everyone for the review, as I said in the thread for the SQL related changes, I will wait for some more time in case there are disagreements. I will merge this in first and we can add follow up PRs at this front. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 merged pull request #6573: Docs: Add information on how to read from branches and tags in Spark docs
jackye1995 merged PR #6573: URL: https://github.com/apache/iceberg/pull/6573 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] renshangtao commented on pull request #6577: Fix the problem of shadow jar without modifying meta-inf/services at the same time
renshangtao commented on PR #6577: URL: https://github.com/apache/iceberg/pull/6577#issuecomment-1382632267 @nastra thank you for your reply. After this modification, the code can be executed as expected. I open the runtime jar package, and the class corresponding to shadowjar in META-INF/services has changed to org.apache.iceberg.shaded.org.apache.orc.inpl.maskMaskProvider, which should be correct -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on pull request #6575: Spark 3.3: support version travel by reference name
jackye1995 commented on PR #6575: URL: https://github.com/apache/iceberg/pull/6575#issuecomment-1382632711 @aokolnychyi @RussellSpitzer @rdblue any opinions about this support? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 opened a new pull request, #6586: AWS: make warehouse path optional for read only catalog use cases
jackye1995 opened a new pull request, #6586: URL: https://github.com/apache/iceberg/pull/6586 Currently no matter in what situation warehouse path must be specified, but in many cases the user just want to initialize Glue catalog to read data, and don't want to pass in a warehouse path. This is to support that use case without user having to specify a dummy warehouse path to bypass the checks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] dennishuo commented on pull request #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables
dennishuo commented on PR #6428: URL: https://github.com/apache/iceberg/pull/6428#issuecomment-1382648326 @jackye1995 Thanks for the heads up! Looks like merging to head fixed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] danielcweeks merged pull request #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables
danielcweeks merged PR #6428: URL: https://github.com/apache/iceberg/pull/6428 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] dmgcodevil opened a new issue, #6587: Wrong class, java.lang.Long, for object: 19367
dmgcodevil opened a new issue, #6587: URL: https://github.com/apache/iceberg/issues/6587 ### Apache Iceberg version None ### Query engine None ### Please describe the bug π I have a timestamp field of type: `timestamptz`. I'm trying to compact files using Spark action but getting the following error: ``` Wrong class, java.lang.Long, for object: 19367 ``` My schema: ``` { "id" : 10, "name" : "tick_timestamp", "required" : false, "type" : "timestamptz" }, { "id" : 11, "name" : "transaction_timestamp", "required" : false, "type" : "timestamptz" } ``` Partition specs: ```json "partition-spec" : [ { "name" : "transaction_timestamp_day", "transform" : "void", "source-id" : 11, "field-id" : 1000 }, { "name" : "id059_bucket", "transform" : "bucket[20]", "source-id" : 5, "field-id" : 1001 }, { "name" : "tick_timestamp_day", "transform" : "day", "source-id" : 10, "field-id" : 1002 } ], "default-spec-id" : 2, "partition-specs" : [ { "spec-id" : 0, "fields" : [ { "name" : "transaction_timestamp_day", "transform" : "day", "source-id" : 11, "field-id" : 1000 }, { "name" : "id059_bucket", "transform" : "bucket[20]", "source-id" : 5, "field-id" : 1001 } ] }, { "spec-id" : 1, "fields" : [ { "name" : "tick_timestamp_day", "transform" : "day", "source-id" : 11, "field-id" : 1000 }, { "name" : "id059_bucket", "transform" : "bucket[20]", "source-id" : 5, "field-id" : 1001 } ] }, { "spec-id" : 2, "fields" : [ { "name" : "transaction_timestamp_day", "transform" : "void", "source-id" : 11, "field-id" : 1000 }, { "name" : "id059_bucket", "transform" : "bucket[20]", "source-id" : 5, "field-id" : 1001 }, { "name" : "tick_timestamp_day", "transform" : "day", "source-id" : 10, "field-id" : 1002 } ] } ] ``` Test: ```java val task = table.newScan().filter().planFiles().asScala.head Comparators.forType(table.spec.partitionType()).compare(task.file().partition(), task.file().partition()) ``` Error: ``` Wrong class, java.lang.Long, for object: 19367 ``` The one thing I noticed is that `` returns the following type: ``` {"type":"record","name":"PartitionData","namespace":"org.apache.iceberg","fields":[{"name":"tick_timestamp_day","type":["null",{"type":"int","logicalType":"date"}],"default":null,"field-id":1000},{"name":"id059_bucket","type":["null","int"],"default":null,"field-id":1001}]} ``` `tick_timestamp_day` has type `date` (int) instead of `timestamptz`. Any thoughts ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] singhpk234 commented on pull request #6576: AWS: Fix check for isTableRegisteredWithLF leading to CREATE table failure
singhpk234 commented on PR #6576: URL: https://github.com/apache/iceberg/pull/6576#issuecomment-1382673430 did some more digging, posting what i found so far, this issue, is only observed in 0.14.x and 1.0.0 (and I directly tested my fix on top of master :sweat_smile:), 1.1.0 & master are fine, This is because starting 1.1.0 we now create a glue table (dummy table if the commit not succeeds, to be removed later) before writing s3 metadata files (which does the isTableEnroledinLF check, and now since the table is already created this check passes), prior to this the table creation use to happen after s3 write, which would result in failure. This above behaviour makes sense considering that this pr should no longer be required atleast to the master (happy to create backport pr's for lower versions if required), hence closing this pr. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] singhpk234 closed pull request #6576: AWS: Fix check for isTableRegisteredWithLF leading to CREATE table failure
singhpk234 closed pull request #6576: AWS: Fix check for isTableRegisteredWithLF leading to CREATE table failure URL: https://github.com/apache/iceberg/pull/6576 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] singhpk234 commented on issue #6523: Table creation fails with Glue catalog on EMR
singhpk234 commented on issue #6523: URL: https://github.com/apache/iceberg/issues/6523#issuecomment-1382673665 please consider using iceberg 1.1.0 release, instead, it has fix for the failure added as part of : - https://github.com/apache/iceberg/pull/4423/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file
ajantha-bhat commented on PR #6461: URL: https://github.com/apache/iceberg/pull/6461#issuecomment-1382677828 > I'm still a little worried about saving this information, what does knowing the sort order mean for a file. Are we guaranteeing that the file is locally sorted by that order? or globally sorted with files added in that snapshot. I'm cautious about doing anything here until we really define what it means. **Spec says this. I am just trying to make the implementation the same as the spec. I am ok with changing the spec or implementation.** > `140 sort_order_id` - ID representing sort order for this file [3]. If the sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](https://iceberg.apache.org/spec/#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files. cc: @rdblue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on a diff in pull request #6575: Spark 3.3: support version travel by reference name
nastra commented on code in PR #6575: URL: https://github.com/apache/iceberg/pull/6575#discussion_r1070237113 ## spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java: ## @@ -231,6 +234,67 @@ public void testVersionAsOf() { assertEquals("Snapshot at specific ID " + snapshotId, expected, fromDF); } + @Test + public void testTagReferenceAsOf() { +Table table = validationCatalog.loadTable(tableIdent); +long snapshotId = table.currentSnapshot().snapshotId(); +table.manageSnapshots().createTag("test_tag", snapshotId).commit(); + +// create a second snapshot, read the table at the snapshot +List expected = sql("SELECT * FROM %s", tableName); +sql("INSERT INTO %s VALUES (4, 'd', 4.0), (5, 'e', 5.0)", tableName); +List actual1 = sql("SELECT * FROM %s VERSION AS OF 'test_tag'", tableName); +assertEquals("Snapshot at specific tag reference name", expected, actual1); + +// read the table at the snapshot +// HIVE time travel syntax +List actual2 = sql("SELECT * FROM %s FOR SYSTEM_VERSION AS OF 'test_tag'", tableName); +assertEquals("Snapshot at specific tag reference name", expected, actual2); + +// read the table using DataFrameReader option: branch +Dataset df = +spark.read().format("iceberg").option(SparkReadOptions.TAG, "test_tag").load(tableName); +List fromDF = rowsToJava(df.collectAsList()); +assertEquals("Snapshot at specific tag reference name", expected, fromDF); + } + + @Test + public void testBranchReferenceAsOf() { +Table table = validationCatalog.loadTable(tableIdent); +long snapshotId = table.currentSnapshot().snapshotId(); +table.manageSnapshots().createBranch("test_branch", snapshotId).commit(); + +// create a second snapshot, read the table at the snapshot +List expected = sql("SELECT * FROM %s", tableName); +sql("INSERT INTO %s VALUES (4, 'd', 4.0), (5, 'e', 5.0)", tableName); +List actual1 = sql("SELECT * FROM %s VERSION AS OF 'test_branch'", tableName); +assertEquals("Snapshot at specific branch reference name", expected, actual1); + +// read the table at the snapshot +// HIVE time travel syntax +List actual2 = +sql("SELECT * FROM %s FOR SYSTEM_VERSION AS OF 'test_branch'", tableName); +assertEquals("Snapshot at specific branch reference name", expected, actual2); + +// read the table using DataFrameReader option: branch +Dataset df = +spark +.read() +.format("iceberg") +.option(SparkReadOptions.BRANCH, "test_branch") +.load(tableName); +List fromDF = rowsToJava(df.collectAsList()); +assertEquals("Snapshot at specific branch reference name", expected, fromDF); + } + + @Test + public void testUnknownReferenceAsOf() { +Assertions.assertThatThrownBy( +() -> sql("SELECT * FROM %s VERSION AS OF 'test_unknown'", tableName)) +.as("Cannot find matching snapshot ID or reference name for version") Review Comment: I think you rather want to have `hasMessage(..)` or `hasMessageContaining(...)` here instead of `as(..)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on pull request #6577: Fix the problem of shadow jar without modifying meta-inf/services at the same time
nastra commented on PR #6577: URL: https://github.com/apache/iceberg/pull/6577#issuecomment-1382694306 > @nastra thank you for your reply. After this modification, the code can be executed as expected. I open the runtime jar package, and the class corresponding to shadowjar in META-INF/services has changed to org.apache.iceberg.shaded.org.apache.orc.inpl.maskMaskProvider, which should be correct oh you're right, it shows up as the relocated name, all good then and still +1 on merging this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on pull request #6525: Python: Refactor loading manifests
Fokko commented on PR #6525: URL: https://github.com/apache/iceberg/pull/6525#issuecomment-1382722862 Thanks for the thorough review and PR @rdblue! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko merged pull request #6525: Python: Refactor loading manifests
Fokko merged PR #6525: URL: https://github.com/apache/iceberg/pull/6525 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko merged pull request #6585: Update ORC to 1.8.2
Fokko merged PR #6585: URL: https://github.com/apache/iceberg/pull/6585 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6574: Python: Raise exception on deletes
Fokko commented on code in PR #6574: URL: https://github.com/apache/iceberg/pull/6574#discussion_r1070259471 ## python/pyiceberg/table/__init__.py: ## @@ -341,7 +346,18 @@ def plan_files(self) -> Iterator[FileScanTask]: all_files = files(io.new_input(manifest.manifest_path)) matching_partition_files = filter(partition_filter, all_files) -yield from (FileScanTask(file) for file in matching_partition_files) +def _check_content(file: ManifestFile) -> ManifestFile: +try: +if file.content == ManifestContent.DELETES: +raise ValueError("PyIceberg does not support deletes: https://github.com/apache/iceberg/issues/6568";) Review Comment: Good call, I ran some table scans locally, but great idea to have this unit-tested as well ππ» -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer opened a new pull request, #6588: Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes
RussellSpitzer opened a new pull request, #6588: URL: https://github.com/apache/iceberg/pull/6588 An issue we've run into frequently is that several Spark actions perform deletes on the driver with a default parallelism of 1. This is quite slow for S3 and painfully slow for very large tables. To fix this we change the default behavior to always be multithreaded deletes. The default for all Spark related actions can then be changed with a SQL Conf parameter as well as within each command with their own parallelism parameters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on pull request #6588: Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes
RussellSpitzer commented on PR #6588: URL: https://github.com/apache/iceberg/pull/6588#issuecomment-1382730394 @anuragmantri + @aokolnychyi + @rdblue - This is a bit of a big default behavior change but it's been biting a lot of our users lately and the change is relatively safe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6588: Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes
RussellSpitzer commented on code in PR #6588: URL: https://github.com/apache/iceberg/pull/6588#discussion_r1070261827 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java: ## @@ -47,4 +47,8 @@ private SparkSQLProperties() {} public static final String PRESERVE_DATA_GROUPING = "spark.sql.iceberg.planning.preserve-data-grouping"; public static final boolean PRESERVE_DATA_GROUPING_DEFAULT = false; + + // Controls how many physical file deletes to execute in parallel when not otherwise specified + public static final String DELETE_PARALLELISM = "driver-delete-default-parallelism"; + public static final String DELETE_PARALLELISM_DEFAULT = "25"; Review Comment: With S3's request throttling around 4k requests a second this gives us a lot of overhead. Assuming a 50ms response time 4000 max requests / Second / 20 requests per thread per second =~ 200 max concurrent requests. Another option for this is to also incorporate the "bulk delete" apis but that would only help with S3 based filesystems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6588: Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes
RussellSpitzer commented on code in PR #6588: URL: https://github.com/apache/iceberg/pull/6588#discussion_r1070261893 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java: ## @@ -231,24 +258,27 @@ protected DeleteSummary deleteFiles( DeleteSummary summary = new DeleteSummary(); -Tasks.foreach(files) -.retry(DELETE_NUM_RETRIES) -.stopRetryOn(NotFoundException.class) -.suppressFailureWhenFinished() -.executeWith(executorService) -.onFailure( -(fileInfo, exc) -> { - String path = fileInfo.getPath(); - String type = fileInfo.getType(); - LOG.warn("Delete failed for {}: {}", type, path, exc); -}) -.run( -fileInfo -> { - String path = fileInfo.getPath(); - String type = fileInfo.getType(); - deleteFunc.accept(path); - summary.deletedFile(path, type); -}); + +withDefaultDeleteService(executorService, (deleteService) -> { Review Comment: This covers ExpireSnapshots and DropTable with Purge -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6588: Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes
RussellSpitzer commented on code in PR #6588: URL: https://github.com/apache/iceberg/pull/6588#discussion_r1070261893 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java: ## @@ -231,24 +258,27 @@ protected DeleteSummary deleteFiles( DeleteSummary summary = new DeleteSummary(); -Tasks.foreach(files) -.retry(DELETE_NUM_RETRIES) -.stopRetryOn(NotFoundException.class) -.suppressFailureWhenFinished() -.executeWith(executorService) -.onFailure( -(fileInfo, exc) -> { - String path = fileInfo.getPath(); - String type = fileInfo.getType(); - LOG.warn("Delete failed for {}: {}", type, path, exc); -}) -.run( -fileInfo -> { - String path = fileInfo.getPath(); - String type = fileInfo.getType(); - deleteFunc.accept(path); - summary.deletedFile(path, type); -}); + +withDefaultDeleteService(executorService, (deleteService) -> { Review Comment: This covers ExpireSnapshots and DropTable with Purge (via DeleteReachableFilesAction) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6588: Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes
RussellSpitzer commented on code in PR #6588: URL: https://github.com/apache/iceberg/pull/6588#discussion_r1070261927 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java: ## @@ -246,12 +246,13 @@ private DeleteOrphanFiles.Result doExecute() { List orphanFiles = findOrphanFiles(spark(), actualFileIdentDS, validFileIdentDS, prefixMismatchMode); -Tasks.foreach(orphanFiles) -.noRetry() -.executeWith(deleteExecutorService) -.suppressFailureWhenFinished() -.onFailure((file, exc) -> LOG.warn("Failed to delete file: {}", file, exc)) -.run(deleteFunc::accept); +withDefaultDeleteService(deleteExecutorService, deleteService -> Review Comment: This change covers RemoveOrphanFiles -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6575: Spark 3.3: support version travel by reference name
RussellSpitzer commented on code in PR #6575: URL: https://github.com/apache/iceberg/pull/6575#discussion_r1070263376 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java: ## @@ -159,7 +160,15 @@ public Table loadTable(Identifier ident, String version) throws NoSuchTableExcep sparkTable.snapshotId() == null, "Cannot do time-travel based on both table identifier and AS OF"); - return sparkTable.copyWithSnapshotId(Long.parseLong(version)); + try { Review Comment: I'm not sure this can come up, but do we allow version tags to be SnapshotIds? Like can I tag snapshot 2 to be known as 1? Weird edge case so I don't think we really need to handle it, just thinking if this is a potential issue with the lookup code here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on issue #6587: Wrong class, java.lang.Long, for object: 19367
RussellSpitzer commented on issue #6587: URL: https://github.com/apache/iceberg/issues/6587#issuecomment-1382733719 Could you post the full trace from the Spark code? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org