Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-11-06 Thread via GitHub


nastra commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1382910267


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/AvroUtil.java:
##
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.Map;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.PartitionData;
+import org.apache.iceberg.avro.AvroEncoderUtil;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+import org.apache.iceberg.data.avro.DecoderResolver;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.types.Types;
+
+/** Class for Avro-related utility methods. */
+class AvroUtil {
+  static final Map FIELD_ID_TO_CLASS =
+  ImmutableMap.of(
+  DataComplete.ASSIGNMENTS_ELEMENT,
+  TopicPartitionOffset.class.getName(),
+  DataFile.PARTITION_ID,
+  PartitionData.class.getName(),
+  DataWritten.TABLE_REFERENCE,
+  TableReference.class.getName(),
+  DataWritten.DATA_FILES_ELEMENT,
+  "org.apache.iceberg.GenericDataFile",
+  DataWritten.DELETE_FILES_ELEMENT,
+  "org.apache.iceberg.GenericDeleteFile",
+  CommitToTable.TABLE_REFERENCE,
+  TableReference.class.getName());
+
+  public static byte[] encode(Event event) {
+try {
+  return AvroEncoderUtil.encode(event, event.getSchema());
+} catch (IOException e) {
+  throw new UncheckedIOException(e);
+}
+  }
+
+  public static Event decode(byte[] bytes) {
+try {
+  Event event = AvroEncoderUtil.decode(bytes);
+  // clear the cache to avoid memory leak
+  DecoderResolver.clearCache();
+  return event;
+} catch (IOException e) {
+  throw new UncheckedIOException(e);
+}
+  }
+
+  static Schema convert(Types.StructType icebergSchema, Class javaClass) {
+return convert(icebergSchema, javaClass, FIELD_ID_TO_CLASS);
+  }
+
+  static Schema convert(
+  Types.StructType icebergSchema,
+  Class javaClass,
+  Map typeMap) {
+return AvroSchemaUtil.convert(
+icebergSchema,
+(fieldId, struct) ->
+struct.equals(icebergSchema) ? javaClass.getName() : 
typeMap.get(fieldId));
+  }
+
+  static int positionToId(int position, Schema avroSchema) {
+List fields = avroSchema.getFields();
+Preconditions.checkArgument(position < fields.size(), "Invalid field 
position: " + position);

Review Comment:
   nit: should this also check that position isn't negative?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Clarify time travel implementation in Iceberg [iceberg]

2023-11-06 Thread via GitHub


nastra commented on code in PR #8982:
URL: https://github.com/apache/iceberg/pull/8982#discussion_r1382914816


##
format/spec.md:
##
@@ -1370,3 +1370,16 @@ Writing v2 metadata:
 * `sort_columns` was removed
 
 Note that these requirements apply when writing data to a v2 table. Tables 
that are upgraded from v1 may contain metadata that does not follow these 
requirements. Implementations should remain backward-compatible with v1 
metadata requirements.
+
+## Appendix F: Implementation Notes
+
+This section covers topics not required by the specification but 
recommendations for systems implementing the Iceberg specification
+to help maintain a uniform experience.
+
+### Point in Time Reads (Time Travel)
+
+Iceberg supports two types of histories for tables. A history of previous 
"current snapshots" stored ["snapshot-log" table 
metadata](#table-metadata-fields) and [parent-child lineage stored in 
"snapshots"](#table-metadata-fields). These two histories 
+might not be consistent for determining a table states at a given point in 
time due to a variety of table operations (e.g. branch-merge table workflows OR 
forcing the current state of table to specific snapshot ID).
+
+When processing point in time queries the Iceberg community has chosen to use 
"snapshot-log" metadata lookup the table state

Review Comment:
   ```suggestion
   When processing point in time queries the Iceberg community has chosen to 
use "snapshot-log" metadata to lookup the table state
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Clarify time travel implementation in Iceberg [iceberg]

2023-11-06 Thread via GitHub


nastra commented on code in PR #8982:
URL: https://github.com/apache/iceberg/pull/8982#discussion_r1382915561


##
format/spec.md:
##
@@ -1370,3 +1370,16 @@ Writing v2 metadata:
 * `sort_columns` was removed
 
 Note that these requirements apply when writing data to a v2 table. Tables 
that are upgraded from v1 may contain metadata that does not follow these 
requirements. Implementations should remain backward-compatible with v1 
metadata requirements.
+
+## Appendix F: Implementation Notes
+
+This section covers topics not required by the specification but 
recommendations for systems implementing the Iceberg specification
+to help maintain a uniform experience.
+
+### Point in Time Reads (Time Travel)
+
+Iceberg supports two types of histories for tables. A history of previous 
"current snapshots" stored ["snapshot-log" table 
metadata](#table-metadata-fields) and [parent-child lineage stored in 
"snapshots"](#table-metadata-fields). These two histories 

Review Comment:
   ```suggestion
   Iceberg supports two types of histories for tables. A history of previous 
"current snapshots" stored in ["snapshot-log" table 
metadata](#table-metadata-fields) and [parent-child lineage stored in 
"snapshots"](#table-metadata-fields). These two histories 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] ICEBERG_CANNOT_OPEN_SPLIT: Error opening Iceberg split s3 [iceberg]

2023-11-06 Thread via GitHub


pawankukreja01 commented on issue #8427:
URL: https://github.com/apache/iceberg/issues/8427#issuecomment-1794305805

   This error occurs only in Athena, while running a query on the table using 
Spark works fine. According to the [Amazon EMR 
documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg.html),
 Apache Iceberg is an open table format for large data sets in Amazon Simple 
Storage Service (Amazon S3). It provides fast query performance over large 
tables, atomic commits, concurrent writes, and SQL-compatible table evolution. 
Starting with Amazon EMR 6.5.0, you can use Apache Spark 3 on Amazon EMR 
clusters with the Iceberg table format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Clarify which columns can be used for equality delete files. [iceberg]

2023-11-06 Thread via GitHub


gaborkaszab commented on code in PR #8981:
URL: https://github.com/apache/iceberg/pull/8981#discussion_r1382965493


##
format/spec.md:
##
@@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` 
then `pos` to optimize
 
 Equality delete files identify deleted rows in a collection of data files by 
one or more column values, and may optionally contain additional columns of the 
deleted row.
 
-Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). Float and double columns cannot 
be used as delete columns in equality delete files.
+Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). The column restrictions for 
columns used in equality delete files are the same as those for [identifier 
fields](#identifier-field-ids) with the exception that optional columns and 
columns nested under optional structs are allowed (if a 
+parent struct column is null it implies the leaf column is null).

Review Comment:
   What would be the meaning of a null in the list of equality IDs? If I'm not 
mistaken equality IDs are used for identifying which columns are used for the 
equality checks, but 'null column' doesn't make sense for me.
   
   If you mean null values in the equality dele files, that's a good question. 
I believe in SQL world NULL doesn't equal to any value, even NULL doesn't equal 
to NULL, so I wonder what would be the desired semantics when we find a NULL 
value in the equality delete file. Should we apply "IS NULL" on that particular 
column? But then it won't be an equality check as the name "equality delete" 
would suggest but an IS NULL check. I'd simply not allow NULL values in the 
delete files TBH.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Remove properties from `JdbcUtil` [iceberg]

2023-11-06 Thread via GitHub


Fokko opened a new issue, #8989:
URL: https://github.com/apache/iceberg/issues/8989

   ### Feature Request / Improvement
   
   There are some duplicate properties: 
https://github.com/apache/iceberg/blob/b0bf62a448617bd5f57ca72c2648452e6600fa20/core/src/main/java/org/apache/iceberg/jdbc/JdbcUtil.java#L44-L45
   
   We can just re-use the ones in `BaseMetastoreTableOperations`: 
https://github.com/apache/iceberg/blob/b0bf62a448617bd5f57ca72c2648452e6600fa20/core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java#L58
   
   Since they are private in `JdbcUtil`, we can safely remove them.
   
   ### Query engine
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-11-06 Thread via GitHub


github-raphael-douyere commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1794358478

   We enabled S3 versioning on the bucket and can see a file name being used 2 
times by 2 distincts micro-batches. So it is not a case of task retry inside 
Spark. 
   
   This issue leads to data loss as the original file is replaced and metadata 
corruption as there is a reference to a file that does not exists anymore.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Ability to the write Metadata JSON [iceberg-python]

2023-11-06 Thread via GitHub


Fokko commented on issue #22:
URL: https://github.com/apache/iceberg-python/issues/22#issuecomment-1794369668

   @HonahX Yes, I think that is how it should be done (Except the builder 
pattern, that's very much Java style :). I think we can split the work into 
several work packages:
   
   - Ability to write the JSON to the object store (that was the intent of this 
PR).
   - Have logic to update the metadata dictionary as you pointed out above. I 
think we can do this per operation (update schema, update partition-spec, 
update sort-order, etc) to keep it small and we can get it in quickly. 
   - Implement the commit method per catalog to update the properties (point to 
the latest metadata, and [set the previous 
metadata](https://github.com/apache/iceberg/blob/b0bf62a448617bd5f57ca72c2648452e6600fa20/core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java#L58)).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] support partition spec update in pyiceberg [iceberg-python]

2023-11-06 Thread via GitHub


Fokko commented on issue #124:
URL: https://github.com/apache/iceberg-python/issues/124#issuecomment-1794371184

   Thanks for raising this! @puchengy which catalog are you using? 
   
   Related is https://github.com/apache/iceberg-python/issues/22


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Flink write iceberg bug(org.apache.iceberg.exceptions.NotFoundException) [iceberg]

2023-11-06 Thread via GitHub


lirui-apache commented on issue #5846:
URL: https://github.com/apache/iceberg/issues/5846#issuecomment-1794403680

   Hi @chenwyi2 @pvary , thanks for the clarifications. Yeah we changed our 
snapshot expire routine to keep the last snapshot created by Flink. If this is 
a limitation by design, I guess it's better to mention this in the doc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump mypy-boto3-glue from 1.28.63 to 1.28.77 [iceberg-python]

2023-11-06 Thread via GitHub


Fokko merged PR #130:
URL: https://github.com/apache/iceberg-python/pull/130


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Catch warning in PyLint tests [iceberg-python]

2023-11-06 Thread via GitHub


Fokko commented on code in PR #33:
URL: https://github.com/apache/iceberg-python/pull/33#discussion_r1383041054


##
pyiceberg/manifest.py:
##
@@ -783,8 +783,8 @@ def __init__(self, spec: PartitionSpec, schema: Schema, 
output_file: OutputFile,
 output_file,
 snapshot_id,
 {
-"schema": schema.json(),
-"partition-spec": spec.json(),
+"schema": schema.model_dump_json(),

Review Comment:
   `.json()` is deprecated and will be replaced with `.model_dump_json()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Catch warning in PyLint tests [iceberg-python]

2023-11-06 Thread via GitHub


Fokko merged PR #33:
URL: https://github.com/apache/iceberg-python/pull/33


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Parquet: don't throw exception on row group filters when reading INT96 column [iceberg]

2023-11-06 Thread via GitHub


nastra commented on code in PR #8988:
URL: https://github.com/apache/iceberg/pull/8988#discussion_r1383166362


##
parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java:
##
@@ -199,7 +201,7 @@ public  Boolean lt(BoundReference ref, Literal 
lit) {
   int id = ref.fieldId();
 
   Boolean hasNonDictPage = isFallback.get(id);
-  if (hasNonDictPage == null || hasNonDictPage) {
+  if (hasNonDictPage == null || hasNonDictPage || isInt96Column(id)) {

Review Comment:
   I'm a bit sceptical about having this check here. Can you elaborate why this 
is the right place to perform the check?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Parquet: don't throw exception on row group filters when reading INT96 column [iceberg]

2023-11-06 Thread via GitHub


nastra commented on code in PR #8988:
URL: https://github.com/apache/iceberg/pull/8988#discussion_r1383168479


##
parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java:
##
@@ -199,7 +201,7 @@ public  Boolean lt(BoundReference ref, Literal 
lit) {
   int id = ref.fieldId();
 
   Boolean hasNonDictPage = isFallback.get(id);
-  if (hasNonDictPage == null || hasNonDictPage) {
+  if (hasNonDictPage == null || hasNonDictPage || isInt96Column(id)) {

Review Comment:
   I would have expected something like this:
   ```
   --- 
a/parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java
   +++ 
b/parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java
   @@ -19,6 +19,7 @@
package org.apache.iceberg.parquet;

import java.io.IOException;
   +import java.nio.ByteOrder;
import java.util.Comparator;
import java.util.Map;
import java.util.Set;
   @@ -447,6 +448,13 @@ public class ParquetDictionaryRowGroupFilter {
  case INT64:
dictSet.add((T) conversion.apply(dict.decodeToLong(i)));
break;
   +  case INT96:
   +dictSet.add(
   +(T)
   +conversion.apply(
   +ParquetUtil.extractTimestampInt96(
   +
dict.decodeToBinary(i).toByteBuffer().order(ByteOrder.LITTLE_ENDIAN;
   +break;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump arrow from 13.0.0 to 14.0.0 [iceberg]

2023-11-06 Thread via GitHub


nastra merged PR #8984:
URL: https://github.com/apache/iceberg/pull/8984


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump software.amazon.awssdk:bom from 2.21.10 to 2.21.15 [iceberg]

2023-11-06 Thread via GitHub


nastra merged PR #8983:
URL: https://github.com/apache/iceberg/pull/8983


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]

2023-11-06 Thread via GitHub


Fokko commented on code in PR #61:
URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383228406


##
pyiceberg/table/snapshots.py:
##
@@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel):
 class SnapshotLogEntry(IcebergBaseModel):
 snapshot_id: int = Field(alias="snapshot-id")
 timestamp_ms: int = Field(alias="timestamp-ms")
+
+
+class SnapshotSummaryCollector:
+added_size: int
+removed_size: int
+added_files: int
+removed_files: int
+added_eq_delete_files: int
+removed_eq_delete_files: int
+added_pos_delete_files: int
+removed_pos_delete_files: int
+added_delete_files: int
+removed_delete_files: int
+added_records: int
+deleted_records: int
+added_pos_deletes: int
+removed_pos_deletes: int
+added_eq_deletes: int
+removed_eq_deletes: int
+
+def __init__(self) -> None:
+self.added_size = 0
+self.removed_size = 0
+self.added_files = 0
+self.removed_files = 0
+self.added_eq_delete_files = 0
+self.removed_eq_delete_files = 0
+self.added_pos_delete_files = 0
+self.removed_pos_delete_files = 0
+self.added_delete_files = 0
+self.removed_delete_files = 0
+self.added_records = 0
+self.deleted_records = 0
+self.added_pos_deletes = 0
+self.removed_pos_deletes = 0
+self.added_eq_deletes = 0
+self.removed_eq_deletes = 0
+
+def add_file(self, data_file: DataFile) -> None:
+self.added_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.added_files += 1
+self.added_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.added_delete_files += 1
+self.added_pos_delete_files += 1
+self.added_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.added_delete_files += 1
+self.added_eq_delete_files += 1
+self.added_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def removed_file(self, data_file: DataFile) -> None:
+self.removed_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.removed_files += 1
+self.deleted_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.removed_delete_files += 1
+self.removed_pos_delete_files += 1
+self.removed_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.removed_delete_files += 1
+self.removed_eq_delete_files += 1
+self.removed_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def added_manifest(self, manifest: ManifestFile) -> None:
+if manifest.content == ManifestContent.DATA:
+self.added_files += manifest.added_files_count or 0
+self.added_records += manifest.added_rows_count or 0
+self.removed_files += manifest.deleted_files_count or 0
+self.deleted_records += manifest.deleted_rows_count or 0
+elif manifest.content == ManifestContent.DELETES:
+self.added_delete_files += manifest.added_files_count or 0
+self.removed_delete_files += manifest.deleted_files_count or 0
+else:
+raise ValueError(f"Unknown manifest file content: 
{manifest.content}")
+
+def build(self) -> Dict[str, str]:
+def set_non_zero(properties: Dict[str, str], num: int, property_name: 
str) -> None:
+if num > 0:
+properties[property_name] = str(num)
+
+properties: Dict[str, str] = {}
+set_non_zero(properties, self.added_size, 'added-files-size')
+set_non_zero(properties, self.removed_size, 'removed-files-size')
+set_non_zero(properties, self.added_files, 'added-data-files')
+set_non_zero(properties, self.removed_files, 'removed-data-files')

Review Comment:
   Ahh, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]

2023-11-06 Thread via GitHub


Fokko commented on code in PR #61:
URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383228989


##
pyiceberg/table/snapshots.py:
##
@@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel):
 class SnapshotLogEntry(IcebergBaseModel):
 snapshot_id: int = Field(alias="snapshot-id")
 timestamp_ms: int = Field(alias="timestamp-ms")
+
+
+class SnapshotSummaryCollector:
+added_size: int
+removed_size: int
+added_files: int
+removed_files: int
+added_eq_delete_files: int
+removed_eq_delete_files: int
+added_pos_delete_files: int
+removed_pos_delete_files: int
+added_delete_files: int
+removed_delete_files: int
+added_records: int
+deleted_records: int
+added_pos_deletes: int
+removed_pos_deletes: int
+added_eq_deletes: int
+removed_eq_deletes: int
+
+def __init__(self) -> None:
+self.added_size = 0
+self.removed_size = 0
+self.added_files = 0
+self.removed_files = 0
+self.added_eq_delete_files = 0
+self.removed_eq_delete_files = 0
+self.added_pos_delete_files = 0
+self.removed_pos_delete_files = 0
+self.added_delete_files = 0
+self.removed_delete_files = 0
+self.added_records = 0
+self.deleted_records = 0
+self.added_pos_deletes = 0
+self.removed_pos_deletes = 0
+self.added_eq_deletes = 0
+self.removed_eq_deletes = 0
+
+def add_file(self, data_file: DataFile) -> None:
+self.added_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.added_files += 1
+self.added_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.added_delete_files += 1
+self.added_pos_delete_files += 1
+self.added_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.added_delete_files += 1
+self.added_eq_delete_files += 1
+self.added_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def removed_file(self, data_file: DataFile) -> None:
+self.removed_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.removed_files += 1
+self.deleted_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.removed_delete_files += 1
+self.removed_pos_delete_files += 1
+self.removed_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.removed_delete_files += 1
+self.removed_eq_delete_files += 1
+self.removed_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def added_manifest(self, manifest: ManifestFile) -> None:
+if manifest.content == ManifestContent.DATA:
+self.added_files += manifest.added_files_count or 0
+self.added_records += manifest.added_rows_count or 0
+self.removed_files += manifest.deleted_files_count or 0
+self.deleted_records += manifest.deleted_rows_count or 0
+elif manifest.content == ManifestContent.DELETES:
+self.added_delete_files += manifest.added_files_count or 0
+self.removed_delete_files += manifest.deleted_files_count or 0
+else:
+raise ValueError(f"Unknown manifest file content: 
{manifest.content}")
+
+def build(self) -> Dict[str, str]:
+def set_non_zero(properties: Dict[str, str], num: int, property_name: 
str) -> None:
+if num > 0:
+properties[property_name] = str(num)
+
+properties: Dict[str, str] = {}
+set_non_zero(properties, self.added_size, 'added-files-size')
+set_non_zero(properties, self.removed_size, 'removed-files-size')
+set_non_zero(properties, self.added_files, 'added-data-files')
+set_non_zero(properties, self.removed_files, 'removed-data-files')
+set_non_zero(properties, self.added_eq_delete_files, 
'added-equality-delete-files')
+set_non_zero(properties, self.removed_eq_delete_files, 
'removed-equality-delete-files')
+set_non_zero(properties, self.added_pos_delete_files, 
'added-position-delete-files')
+set_non_zero(properties, self.removed_pos_delete_files, 
'removed-position-delete-files')
+set_non_zero(properties, self.added_delete_files, 'added-delete-files')
+set_non_zero(properties, self.removed_delete_files, 
'removed-delete-files')
+set_non_zero(properties, self.added_records, 'added-records')
+set_non_zero(properties, self.d

Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]

2023-11-06 Thread via GitHub


Fokko commented on code in PR #61:
URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383232094


##
pyiceberg/table/snapshots.py:
##
@@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel):
 class SnapshotLogEntry(IcebergBaseModel):
 snapshot_id: int = Field(alias="snapshot-id")
 timestamp_ms: int = Field(alias="timestamp-ms")
+
+
+class SnapshotSummaryCollector:
+added_size: int
+removed_size: int
+added_files: int
+removed_files: int
+added_eq_delete_files: int
+removed_eq_delete_files: int
+added_pos_delete_files: int
+removed_pos_delete_files: int
+added_delete_files: int
+removed_delete_files: int
+added_records: int
+deleted_records: int
+added_pos_deletes: int
+removed_pos_deletes: int
+added_eq_deletes: int
+removed_eq_deletes: int
+
+def __init__(self) -> None:
+self.added_size = 0
+self.removed_size = 0
+self.added_files = 0
+self.removed_files = 0
+self.added_eq_delete_files = 0
+self.removed_eq_delete_files = 0
+self.added_pos_delete_files = 0
+self.removed_pos_delete_files = 0
+self.added_delete_files = 0
+self.removed_delete_files = 0
+self.added_records = 0
+self.deleted_records = 0
+self.added_pos_deletes = 0
+self.removed_pos_deletes = 0
+self.added_eq_deletes = 0
+self.removed_eq_deletes = 0
+
+def add_file(self, data_file: DataFile) -> None:
+self.added_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.added_files += 1
+self.added_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.added_delete_files += 1
+self.added_pos_delete_files += 1
+self.added_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.added_delete_files += 1
+self.added_eq_delete_files += 1
+self.added_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def removed_file(self, data_file: DataFile) -> None:
+self.removed_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.removed_files += 1
+self.deleted_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.removed_delete_files += 1
+self.removed_pos_delete_files += 1
+self.removed_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.removed_delete_files += 1
+self.removed_eq_delete_files += 1
+self.removed_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def added_manifest(self, manifest: ManifestFile) -> None:
+if manifest.content == ManifestContent.DATA:
+self.added_files += manifest.added_files_count or 0
+self.added_records += manifest.added_rows_count or 0
+self.removed_files += manifest.deleted_files_count or 0
+self.deleted_records += manifest.deleted_rows_count or 0
+elif manifest.content == ManifestContent.DELETES:
+self.added_delete_files += manifest.added_files_count or 0
+self.removed_delete_files += manifest.deleted_files_count or 0
+else:
+raise ValueError(f"Unknown manifest file content: 
{manifest.content}")
+
+def build(self) -> Dict[str, str]:
+def set_non_zero(properties: Dict[str, str], num: int, property_name: 
str) -> None:
+if num > 0:
+properties[property_name] = str(num)
+
+properties: Dict[str, str] = {}
+set_non_zero(properties, self.added_size, 'added-files-size')
+set_non_zero(properties, self.removed_size, 'removed-files-size')
+set_non_zero(properties, self.added_files, 'added-data-files')
+set_non_zero(properties, self.removed_files, 'removed-data-files')
+set_non_zero(properties, self.added_eq_delete_files, 
'added-equality-delete-files')
+set_non_zero(properties, self.removed_eq_delete_files, 
'removed-equality-delete-files')
+set_non_zero(properties, self.added_pos_delete_files, 
'added-position-delete-files')
+set_non_zero(properties, self.removed_pos_delete_files, 
'removed-position-delete-files')
+set_non_zero(properties, self.added_delete_files, 'added-delete-files')
+set_non_zero(properties, self.removed_delete_files, 
'removed-delete-files')
+set_non_zero(properties, self.added_records, 'added-records')
+set_non_zero(properties, self.d

Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]

2023-11-06 Thread via GitHub


Fokko commented on code in PR #61:
URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383244841


##
pyiceberg/table/snapshots.py:
##
@@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel):
 class SnapshotLogEntry(IcebergBaseModel):
 snapshot_id: int = Field(alias="snapshot-id")
 timestamp_ms: int = Field(alias="timestamp-ms")
+
+
+class SnapshotSummaryCollector:
+added_size: int
+removed_size: int
+added_files: int
+removed_files: int
+added_eq_delete_files: int
+removed_eq_delete_files: int
+added_pos_delete_files: int
+removed_pos_delete_files: int
+added_delete_files: int
+removed_delete_files: int
+added_records: int
+deleted_records: int
+added_pos_deletes: int
+removed_pos_deletes: int
+added_eq_deletes: int
+removed_eq_deletes: int
+
+def __init__(self) -> None:
+self.added_size = 0
+self.removed_size = 0
+self.added_files = 0
+self.removed_files = 0
+self.added_eq_delete_files = 0
+self.removed_eq_delete_files = 0
+self.added_pos_delete_files = 0
+self.removed_pos_delete_files = 0
+self.added_delete_files = 0
+self.removed_delete_files = 0
+self.added_records = 0
+self.deleted_records = 0
+self.added_pos_deletes = 0
+self.removed_pos_deletes = 0
+self.added_eq_deletes = 0
+self.removed_eq_deletes = 0
+
+def add_file(self, data_file: DataFile) -> None:
+self.added_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.added_files += 1
+self.added_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.added_delete_files += 1
+self.added_pos_delete_files += 1
+self.added_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.added_delete_files += 1
+self.added_eq_delete_files += 1
+self.added_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def removed_file(self, data_file: DataFile) -> None:
+self.removed_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.removed_files += 1
+self.deleted_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.removed_delete_files += 1
+self.removed_pos_delete_files += 1
+self.removed_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.removed_delete_files += 1
+self.removed_eq_delete_files += 1
+self.removed_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def added_manifest(self, manifest: ManifestFile) -> None:
+if manifest.content == ManifestContent.DATA:
+self.added_files += manifest.added_files_count or 0
+self.added_records += manifest.added_rows_count or 0
+self.removed_files += manifest.deleted_files_count or 0
+self.deleted_records += manifest.deleted_rows_count or 0
+elif manifest.content == ManifestContent.DELETES:
+self.added_delete_files += manifest.added_files_count or 0
+self.removed_delete_files += manifest.deleted_files_count or 0
+else:
+raise ValueError(f"Unknown manifest file content: 
{manifest.content}")
+
+def build(self) -> Dict[str, str]:
+def set_non_zero(properties: Dict[str, str], num: int, property_name: 
str) -> None:
+if num > 0:
+properties[property_name] = str(num)
+
+properties: Dict[str, str] = {}
+set_non_zero(properties, self.added_size, 'added-files-size')
+set_non_zero(properties, self.removed_size, 'removed-files-size')
+set_non_zero(properties, self.added_files, 'added-data-files')
+set_non_zero(properties, self.removed_files, 'removed-data-files')
+set_non_zero(properties, self.added_eq_delete_files, 
'added-equality-delete-files')
+set_non_zero(properties, self.removed_eq_delete_files, 
'removed-equality-delete-files')
+set_non_zero(properties, self.added_pos_delete_files, 
'added-position-delete-files')
+set_non_zero(properties, self.removed_pos_delete_files, 
'removed-position-delete-files')
+set_non_zero(properties, self.added_delete_files, 'added-delete-files')
+set_non_zero(properties, self.removed_delete_files, 
'removed-delete-files')
+set_non_zero(properties, self.added_records, 'added-records')
+set_non_zero(properties, self.d

Re: [I] Flink write iceberg bug(org.apache.iceberg.exceptions.NotFoundException) [iceberg]

2023-11-06 Thread via GitHub


pvary commented on issue #5846:
URL: https://github.com/apache/iceberg/issues/5846#issuecomment-1794748179

   @lirui-apache: Would you mind adding this to `docs/flink-writes.md`? I would 
be happy to review.
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]

2023-11-06 Thread via GitHub


Fokko commented on code in PR #61:
URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383290717


##
pyiceberg/table/snapshots.py:
##
@@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel):
 class SnapshotLogEntry(IcebergBaseModel):
 snapshot_id: int = Field(alias="snapshot-id")
 timestamp_ms: int = Field(alias="timestamp-ms")
+
+
+class SnapshotSummaryCollector:
+added_size: int
+removed_size: int
+added_files: int
+removed_files: int
+added_eq_delete_files: int
+removed_eq_delete_files: int
+added_pos_delete_files: int
+removed_pos_delete_files: int
+added_delete_files: int
+removed_delete_files: int
+added_records: int
+deleted_records: int
+added_pos_deletes: int
+removed_pos_deletes: int
+added_eq_deletes: int
+removed_eq_deletes: int
+
+def __init__(self) -> None:
+self.added_size = 0
+self.removed_size = 0
+self.added_files = 0
+self.removed_files = 0
+self.added_eq_delete_files = 0
+self.removed_eq_delete_files = 0
+self.added_pos_delete_files = 0
+self.removed_pos_delete_files = 0
+self.added_delete_files = 0
+self.removed_delete_files = 0
+self.added_records = 0
+self.deleted_records = 0
+self.added_pos_deletes = 0
+self.removed_pos_deletes = 0
+self.added_eq_deletes = 0
+self.removed_eq_deletes = 0
+
+def add_file(self, data_file: DataFile) -> None:
+self.added_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.added_files += 1
+self.added_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.added_delete_files += 1
+self.added_pos_delete_files += 1
+self.added_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.added_delete_files += 1
+self.added_eq_delete_files += 1
+self.added_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def removed_file(self, data_file: DataFile) -> None:
+self.removed_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.removed_files += 1
+self.deleted_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.removed_delete_files += 1
+self.removed_pos_delete_files += 1
+self.removed_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.removed_delete_files += 1
+self.removed_eq_delete_files += 1
+self.removed_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def added_manifest(self, manifest: ManifestFile) -> None:
+if manifest.content == ManifestContent.DATA:
+self.added_files += manifest.added_files_count or 0
+self.added_records += manifest.added_rows_count or 0
+self.removed_files += manifest.deleted_files_count or 0
+self.deleted_records += manifest.deleted_rows_count or 0
+elif manifest.content == ManifestContent.DELETES:
+self.added_delete_files += manifest.added_files_count or 0
+self.removed_delete_files += manifest.deleted_files_count or 0
+else:
+raise ValueError(f"Unknown manifest file content: 
{manifest.content}")
+
+def build(self) -> Dict[str, str]:
+def set_non_zero(properties: Dict[str, str], num: int, property_name: 
str) -> None:
+if num > 0:
+properties[property_name] = str(num)
+
+properties: Dict[str, str] = {}
+set_non_zero(properties, self.added_size, 'added-files-size')
+set_non_zero(properties, self.removed_size, 'removed-files-size')
+set_non_zero(properties, self.added_files, 'added-data-files')
+set_non_zero(properties, self.removed_files, 'removed-data-files')
+set_non_zero(properties, self.added_eq_delete_files, 
'added-equality-delete-files')
+set_non_zero(properties, self.removed_eq_delete_files, 
'removed-equality-delete-files')
+set_non_zero(properties, self.added_pos_delete_files, 
'added-position-delete-files')
+set_non_zero(properties, self.removed_pos_delete_files, 
'removed-position-delete-files')
+set_non_zero(properties, self.added_delete_files, 'added-delete-files')
+set_non_zero(properties, self.removed_delete_files, 
'removed-delete-files')
+set_non_zero(properties, self.added_records, 'added-records')
+set_non_zero(properties, self.d

Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]

2023-11-06 Thread via GitHub


Fokko commented on code in PR #61:
URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383292093


##
pyiceberg/table/snapshots.py:
##
@@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel):
 class SnapshotLogEntry(IcebergBaseModel):
 snapshot_id: int = Field(alias="snapshot-id")
 timestamp_ms: int = Field(alias="timestamp-ms")
+
+
+class SnapshotSummaryCollector:
+added_size: int
+removed_size: int
+added_files: int
+removed_files: int
+added_eq_delete_files: int
+removed_eq_delete_files: int
+added_pos_delete_files: int
+removed_pos_delete_files: int
+added_delete_files: int
+removed_delete_files: int
+added_records: int
+deleted_records: int
+added_pos_deletes: int
+removed_pos_deletes: int
+added_eq_deletes: int
+removed_eq_deletes: int
+
+def __init__(self) -> None:
+self.added_size = 0
+self.removed_size = 0
+self.added_files = 0
+self.removed_files = 0
+self.added_eq_delete_files = 0
+self.removed_eq_delete_files = 0
+self.added_pos_delete_files = 0
+self.removed_pos_delete_files = 0
+self.added_delete_files = 0
+self.removed_delete_files = 0
+self.added_records = 0
+self.deleted_records = 0
+self.added_pos_deletes = 0
+self.removed_pos_deletes = 0
+self.added_eq_deletes = 0
+self.removed_eq_deletes = 0
+
+def add_file(self, data_file: DataFile) -> None:

Review Comment:
   Yes, this represents the `data_file` in the Manifest: 
https://iceberg.apache.org/spec/#manifests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]

2023-11-06 Thread via GitHub


Fokko commented on code in PR #61:
URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383293089


##
pyiceberg/table/snapshots.py:
##
@@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel):
 class SnapshotLogEntry(IcebergBaseModel):
 snapshot_id: int = Field(alias="snapshot-id")
 timestamp_ms: int = Field(alias="timestamp-ms")
+
+
+class SnapshotSummaryCollector:
+added_size: int
+removed_size: int
+added_files: int
+removed_files: int
+added_eq_delete_files: int
+removed_eq_delete_files: int
+added_pos_delete_files: int
+removed_pos_delete_files: int
+added_delete_files: int
+removed_delete_files: int
+added_records: int
+deleted_records: int
+added_pos_deletes: int
+removed_pos_deletes: int
+added_eq_deletes: int
+removed_eq_deletes: int
+
+def __init__(self) -> None:
+self.added_size = 0
+self.removed_size = 0
+self.added_files = 0
+self.removed_files = 0
+self.added_eq_delete_files = 0
+self.removed_eq_delete_files = 0
+self.added_pos_delete_files = 0
+self.removed_pos_delete_files = 0
+self.added_delete_files = 0
+self.removed_delete_files = 0
+self.added_records = 0
+self.deleted_records = 0
+self.added_pos_deletes = 0
+self.removed_pos_deletes = 0
+self.added_eq_deletes = 0
+self.removed_eq_deletes = 0
+
+def add_file(self, data_file: DataFile) -> None:
+self.added_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.added_files += 1
+self.added_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.added_delete_files += 1
+self.added_pos_delete_files += 1
+self.added_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.added_delete_files += 1
+self.added_eq_delete_files += 1
+self.added_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def removed_file(self, data_file: DataFile) -> None:

Review Comment:
   Good one, I went for `add_file` and `remove_file`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]

2023-11-06 Thread via GitHub


Fokko commented on code in PR #61:
URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383296211


##
pyiceberg/table/snapshots.py:
##
@@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel):
 class SnapshotLogEntry(IcebergBaseModel):
 snapshot_id: int = Field(alias="snapshot-id")
 timestamp_ms: int = Field(alias="timestamp-ms")
+
+
+class SnapshotSummaryCollector:
+added_size: int
+removed_size: int
+added_files: int
+removed_files: int
+added_eq_delete_files: int
+removed_eq_delete_files: int
+added_pos_delete_files: int
+removed_pos_delete_files: int
+added_delete_files: int
+removed_delete_files: int
+added_records: int
+deleted_records: int
+added_pos_deletes: int
+removed_pos_deletes: int
+added_eq_deletes: int
+removed_eq_deletes: int
+
+def __init__(self) -> None:
+self.added_size = 0
+self.removed_size = 0
+self.added_files = 0
+self.removed_files = 0
+self.added_eq_delete_files = 0
+self.removed_eq_delete_files = 0
+self.added_pos_delete_files = 0
+self.removed_pos_delete_files = 0
+self.added_delete_files = 0
+self.removed_delete_files = 0
+self.added_records = 0
+self.deleted_records = 0
+self.added_pos_deletes = 0
+self.removed_pos_deletes = 0
+self.added_eq_deletes = 0
+self.removed_eq_deletes = 0
+
+def add_file(self, data_file: DataFile) -> None:
+self.added_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.added_files += 1
+self.added_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.added_delete_files += 1
+self.added_pos_delete_files += 1
+self.added_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.added_delete_files += 1
+self.added_eq_delete_files += 1
+self.added_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def removed_file(self, data_file: DataFile) -> None:
+self.removed_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.removed_files += 1
+self.deleted_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.removed_delete_files += 1
+self.removed_pos_delete_files += 1
+self.removed_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.removed_delete_files += 1
+self.removed_eq_delete_files += 1
+self.removed_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def added_manifest(self, manifest: ManifestFile) -> None:

Review Comment:
   When you append to a table, we'll append the existing manifests: 
https://github.com/apache/iceberg-python/pull/41/files#diff-23e8153e0fd497a9212215bd2067068f3b56fa071770c7ef326db3d3d03cee9bR1811-R1821
   
   Any concerns?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]

2023-11-06 Thread via GitHub


Fokko commented on code in PR #61:
URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383296718


##
pyiceberg/table/snapshots.py:
##
@@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel):
 class SnapshotLogEntry(IcebergBaseModel):
 snapshot_id: int = Field(alias="snapshot-id")
 timestamp_ms: int = Field(alias="timestamp-ms")
+
+
+class SnapshotSummaryCollector:
+added_size: int
+removed_size: int
+added_files: int
+removed_files: int
+added_eq_delete_files: int
+removed_eq_delete_files: int
+added_pos_delete_files: int
+removed_pos_delete_files: int
+added_delete_files: int
+removed_delete_files: int
+added_records: int
+deleted_records: int
+added_pos_deletes: int
+removed_pos_deletes: int
+added_eq_deletes: int
+removed_eq_deletes: int
+
+def __init__(self) -> None:
+self.added_size = 0
+self.removed_size = 0
+self.added_files = 0
+self.removed_files = 0
+self.added_eq_delete_files = 0
+self.removed_eq_delete_files = 0
+self.added_pos_delete_files = 0
+self.removed_pos_delete_files = 0
+self.added_delete_files = 0
+self.removed_delete_files = 0
+self.added_records = 0
+self.deleted_records = 0
+self.added_pos_deletes = 0
+self.removed_pos_deletes = 0
+self.added_eq_deletes = 0
+self.removed_eq_deletes = 0
+
+def add_file(self, data_file: DataFile) -> None:
+self.added_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.added_files += 1
+self.added_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.added_delete_files += 1
+self.added_pos_delete_files += 1
+self.added_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.added_delete_files += 1
+self.added_eq_delete_files += 1
+self.added_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def removed_file(self, data_file: DataFile) -> None:
+self.removed_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.removed_files += 1
+self.deleted_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.removed_delete_files += 1
+self.removed_pos_delete_files += 1
+self.removed_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.removed_delete_files += 1
+self.removed_eq_delete_files += 1
+self.removed_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def added_manifest(self, manifest: ManifestFile) -> None:
+if manifest.content == ManifestContent.DATA:
+self.added_files += manifest.added_files_count or 0
+self.added_records += manifest.added_rows_count or 0
+self.removed_files += manifest.deleted_files_count or 0
+self.deleted_records += manifest.deleted_rows_count or 0
+elif manifest.content == ManifestContent.DELETES:
+self.added_delete_files += manifest.added_files_count or 0
+self.removed_delete_files += manifest.deleted_files_count or 0
+else:
+raise ValueError(f"Unknown manifest file content: 
{manifest.content}")
+
+def build(self) -> Dict[str, str]:
+def set_non_zero(properties: Dict[str, str], num: int, property_name: 
str) -> None:

Review Comment:
   Good one!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Iceberg streaming streaming-skip-overwrite-snapshots SparkMicroBatchStream only skips over one file per trigger [iceberg]

2023-11-06 Thread via GitHub


cccs-jc commented on code in PR #8980:
URL: https://github.com/apache/iceberg/pull/8980#discussion_r1383313970


##
core/src/main/java/org/apache/iceberg/MicroBatches.java:
##
@@ -92,7 +92,7 @@ private static List> 
indexManifests(
 
 for (ManifestFile manifest : manifestFiles) {
   manifestIndexes.add(Pair.of(manifest, currentFileIndex));
-  currentFileIndex += manifest.addedFilesCount() + 
manifest.existingFilesCount();
+  currentFileIndex += manifest.addedFilesCount();

Review Comment:
   I would but I don't know how?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]

2023-11-06 Thread via GitHub


Fokko commented on code in PR #61:
URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383319743


##
pyiceberg/table/snapshots.py:
##
@@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel):
 class SnapshotLogEntry(IcebergBaseModel):
 snapshot_id: int = Field(alias="snapshot-id")
 timestamp_ms: int = Field(alias="timestamp-ms")
+
+
+class SnapshotSummaryCollector:
+added_size: int
+removed_size: int
+added_files: int
+removed_files: int
+added_eq_delete_files: int
+removed_eq_delete_files: int
+added_pos_delete_files: int
+removed_pos_delete_files: int
+added_delete_files: int
+removed_delete_files: int
+added_records: int
+deleted_records: int
+added_pos_deletes: int
+removed_pos_deletes: int
+added_eq_deletes: int
+removed_eq_deletes: int
+
+def __init__(self) -> None:
+self.added_size = 0
+self.removed_size = 0
+self.added_files = 0
+self.removed_files = 0
+self.added_eq_delete_files = 0
+self.removed_eq_delete_files = 0
+self.added_pos_delete_files = 0
+self.removed_pos_delete_files = 0
+self.added_delete_files = 0
+self.removed_delete_files = 0
+self.added_records = 0
+self.deleted_records = 0
+self.added_pos_deletes = 0
+self.removed_pos_deletes = 0
+self.added_eq_deletes = 0
+self.removed_eq_deletes = 0
+
+def add_file(self, data_file: DataFile) -> None:
+self.added_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.added_files += 1
+self.added_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.added_delete_files += 1
+self.added_pos_delete_files += 1
+self.added_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.added_delete_files += 1
+self.added_eq_delete_files += 1
+self.added_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def removed_file(self, data_file: DataFile) -> None:
+self.removed_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.removed_files += 1
+self.deleted_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.removed_delete_files += 1
+self.removed_pos_delete_files += 1
+self.removed_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.removed_delete_files += 1
+self.removed_eq_delete_files += 1
+self.removed_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def added_manifest(self, manifest: ManifestFile) -> None:

Review Comment:
   I'm also okay with going with normal appends instead of fast-appends



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]

2023-11-06 Thread via GitHub


Fokko commented on code in PR #61:
URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383329498


##
pyiceberg/table/snapshots.py:
##
@@ -116,3 +144,199 @@ class MetadataLogEntry(IcebergBaseModel):
 class SnapshotLogEntry(IcebergBaseModel):
 snapshot_id: int = Field(alias="snapshot-id")
 timestamp_ms: int = Field(alias="timestamp-ms")
+
+
+class SnapshotSummaryCollector:
+added_size: int
+removed_size: int
+added_files: int
+removed_files: int
+added_eq_delete_files: int
+removed_eq_delete_files: int
+added_pos_delete_files: int
+removed_pos_delete_files: int
+added_delete_files: int
+removed_delete_files: int
+added_records: int
+deleted_records: int
+added_pos_deletes: int
+removed_pos_deletes: int
+added_eq_deletes: int
+removed_eq_deletes: int
+
+def __init__(self) -> None:
+self.added_size = 0
+self.removed_size = 0
+self.added_files = 0
+self.removed_files = 0
+self.added_eq_delete_files = 0
+self.removed_eq_delete_files = 0
+self.added_pos_delete_files = 0
+self.removed_pos_delete_files = 0
+self.added_delete_files = 0
+self.removed_delete_files = 0
+self.added_records = 0
+self.deleted_records = 0
+self.added_pos_deletes = 0
+self.removed_pos_deletes = 0
+self.added_eq_deletes = 0
+self.removed_eq_deletes = 0
+
+def add_file(self, data_file: DataFile) -> None:
+if data_file.content == DataFileContent.DATA:
+self.added_files += 1
+self.added_records += data_file.record_count
+self.added_size += data_file.file_size_in_bytes
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.added_delete_files += 1
+self.added_pos_delete_files += 1
+self.added_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.added_delete_files += 1
+self.added_eq_delete_files += 1
+self.added_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def removed_file(self, data_file: DataFile) -> None:
+if data_file.content == DataFileContent.DATA:
+self.removed_files += 1
+self.deleted_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.removed_delete_files += 1
+self.removed_pos_delete_files += 1
+self.removed_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.removed_delete_files += 1
+self.removed_eq_delete_files += 1
+self.removed_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def added_manifest(self, manifest: ManifestFile) -> None:
+if manifest.content == ManifestContent.DATA:
+self.added_files += manifest.added_files_count or 0
+self.added_records += manifest.added_rows_count or 0
+self.removed_files += manifest.deleted_files_count or 0
+self.deleted_records += manifest.deleted_rows_count or 0
+elif manifest.content == ManifestContent.DELETES:
+self.added_delete_files += manifest.added_files_count or 0
+self.removed_delete_files += manifest.deleted_files_count or 0
+else:
+raise ValueError(f"Unknown manifest file content: 
{manifest.content}")
+
+def build(self) -> Dict[str, str]:
+def set_non_zero(properties: Dict[str, str], num: int, property_name: 
str) -> None:
+if num > 0:
+properties[property_name] = str(num)
+
+properties: Dict[str, str] = {}
+set_non_zero(properties, self.added_size, 'added-files-size')
+set_non_zero(properties, self.removed_size, 'removed-files-size')
+set_non_zero(properties, self.added_files, 'added-data-files')
+set_non_zero(properties, self.removed_files, 'removed-data-files')
+set_non_zero(properties, self.added_eq_delete_files, 
'added-equality-delete-files')
+set_non_zero(properties, self.removed_eq_delete_files, 
'removed-equality-delete-files')
+set_non_zero(properties, self.added_pos_delete_files, 
'added-position-delete-files')
+set_non_zero(properties, self.removed_pos_delete_files, 
'removed-position-delete-files')
+set_non_zero(properties, self.added_delete_files, 'added-delete-files')
+set_non_zero(properties, self.removed_delete_files, 
'removed-delete-files')
+set_non_zero(properties, self.added_records, 'added-records')
+set_non_zero(properties, self.deleted_records, 'deleted-records')
+set_non_zero(pr

Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]

2023-11-06 Thread via GitHub


Fokko commented on code in PR #61:
URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383319743


##
pyiceberg/table/snapshots.py:
##
@@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel):
 class SnapshotLogEntry(IcebergBaseModel):
 snapshot_id: int = Field(alias="snapshot-id")
 timestamp_ms: int = Field(alias="timestamp-ms")
+
+
+class SnapshotSummaryCollector:
+added_size: int
+removed_size: int
+added_files: int
+removed_files: int
+added_eq_delete_files: int
+removed_eq_delete_files: int
+added_pos_delete_files: int
+removed_pos_delete_files: int
+added_delete_files: int
+removed_delete_files: int
+added_records: int
+deleted_records: int
+added_pos_deletes: int
+removed_pos_deletes: int
+added_eq_deletes: int
+removed_eq_deletes: int
+
+def __init__(self) -> None:
+self.added_size = 0
+self.removed_size = 0
+self.added_files = 0
+self.removed_files = 0
+self.added_eq_delete_files = 0
+self.removed_eq_delete_files = 0
+self.added_pos_delete_files = 0
+self.removed_pos_delete_files = 0
+self.added_delete_files = 0
+self.removed_delete_files = 0
+self.added_records = 0
+self.deleted_records = 0
+self.added_pos_deletes = 0
+self.removed_pos_deletes = 0
+self.added_eq_deletes = 0
+self.removed_eq_deletes = 0
+
+def add_file(self, data_file: DataFile) -> None:
+self.added_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.added_files += 1
+self.added_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.added_delete_files += 1
+self.added_pos_delete_files += 1
+self.added_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.added_delete_files += 1
+self.added_eq_delete_files += 1
+self.added_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def removed_file(self, data_file: DataFile) -> None:
+self.removed_size += data_file.file_size_in_bytes
+
+if data_file.content == DataFileContent.DATA:
+self.removed_files += 1
+self.deleted_records += data_file.record_count
+elif data_file.content == DataFileContent.POSITION_DELETES:
+self.removed_delete_files += 1
+self.removed_pos_delete_files += 1
+self.removed_pos_deletes += data_file.record_count
+elif data_file.content == DataFileContent.EQUALITY_DELETES:
+self.removed_delete_files += 1
+self.removed_eq_delete_files += 1
+self.removed_eq_deletes += data_file.record_count
+else:
+raise ValueError(f"Unknown data file content: {data_file.content}")
+
+def added_manifest(self, manifest: ManifestFile) -> None:

Review Comment:
   I'm also okay with going with normal appends instead of fast-appends. I've 
removed the method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383335036


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -264,6 +251,146 @@ public void renameTable(TableIdentifier from, 
TableIdentifier originalTo) {
 }
   }
 
+  @Override
+  public boolean viewExists(TableIdentifier identifier) {
+return HiveCatalogUtil.isTableWithTypeExists(clients, identifier, 
TableType.VIRTUAL_VIEW);
+  }
+
+  @Override
+  public boolean dropView(TableIdentifier identifier) {
+if (!isValidIdentifier(identifier)) {
+  return false;
+}
+try {
+  String database = identifier.namespace().level(0);
+  String viewName = identifier.name();
+  Table table = clients.run(client -> client.getTable(database, viewName));
+  HiveCatalogUtil.validateTableIsIcebergView(table, fullTableName(name, 
identifier));
+  clients.run(
+  client -> {
+client.dropTable(database, viewName);
+return null;
+  });
+  LOG.info("Dropped View: {}", identifier);
+  return true;
+
+} catch (NoSuchViewException | NoSuchObjectException e) {
+  LOG.info("Skipping drop, View does not exist: {}", identifier, e);
+  return false;
+} catch (TException e) {
+  throw new RuntimeException("Failed to drop " + identifier, e);
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  throw new RuntimeException("Interrupted in call to dropView", e);
+}
+  }
+
+  @Override
+  public List listViews(Namespace namespace) {
+try {
+  return listContents(namespace, TableType.VIRTUAL_VIEW.name(), 
icebergPredicate());
+} catch (UnknownDBException e) {
+  throw new NoSuchNamespaceException("Namespace does not exist: %s", 
namespace);
+
+} catch (TException e) {
+  throw new RuntimeException("Failed to list all views under namespace " + 
namespace, e);
+
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  throw new RuntimeException("Interrupted in call to listViews", e);
+}
+  }
+
+  private List listContents(
+  Namespace namespace, String tableType, Predicate tablePredicate)
+  throws TException, InterruptedException {
+Preconditions.checkArgument(
+isValidateNamespace(namespace), "Missing database in namespace: %s", 
namespace);
+String database = namespace.level(0);
+List tableNames =
+StringUtils.isNotEmpty(tableType)
+? clients.run(client -> client.getTables(database, "*", 
TableType.valueOf(tableType)))
+: clients.run(client -> client.getAllTables(database));
+List tableObjects =
+clients.run(client -> client.getTableObjectsByName(database, 
tableNames));

Review Comment:
   This HMS call was not used before the patch when all tables were listed. 
After the patch we fetch all of the tables, even if it is not neccessary.
   
   I remember cases where there were too many tables in a database, and 
fetching all of the tables were problematic, so I would like us to find a 
better solution here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383336007


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -115,42 +126,13 @@ public void initialize(String inputName, Map properties) {
 
   @Override
   public List listTables(Namespace namespace) {
-Preconditions.checkArgument(
-isValidateNamespace(namespace), "Missing database in namespace: %s", 
namespace);
-String database = namespace.level(0);
 
 try {
-  List tableNames = clients.run(client -> 
client.getAllTables(database));
-  List tableIdentifiers;
-
   if (listAllTables) {
-tableIdentifiers =
-tableNames.stream()
-.map(t -> TableIdentifier.of(namespace, t))
-.collect(Collectors.toList());
+return listContents(namespace, null, table -> true);
   } else {
-List tableObjects =
-clients.run(client -> client.getTableObjectsByName(database, 
tableNames));
-tableIdentifiers =
-tableObjects.stream()
-.filter(
-table ->
-table.getParameters() != null
-&& 
BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE
-.equalsIgnoreCase(
-table
-.getParameters()
-
.get(BaseMetastoreTableOperations.TABLE_TYPE_PROP)))
-.map(table -> TableIdentifier.of(namespace, 
table.getTableName()))
-.collect(Collectors.toList());
+return listContents(namespace, TableType.EXTERNAL_TABLE.name(), 
icebergPredicate());
   }
-
-  LOG.debug(
-  "Listing of namespace: {} resulted in the following tables: {}",
-  namespace,
-  tableIdentifiers);

Review Comment:
   Did we intentionally removed this log line?



##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -115,42 +126,13 @@ public void initialize(String inputName, Map properties) {
 
   @Override
   public List listTables(Namespace namespace) {
-Preconditions.checkArgument(
-isValidateNamespace(namespace), "Missing database in namespace: %s", 
namespace);
-String database = namespace.level(0);
 
 try {
-  List tableNames = clients.run(client -> 
client.getAllTables(database));
-  List tableIdentifiers;
-
   if (listAllTables) {
-tableIdentifiers =
-tableNames.stream()
-.map(t -> TableIdentifier.of(namespace, t))
-.collect(Collectors.toList());
+return listContents(namespace, null, table -> true);
   } else {
-List tableObjects =
-clients.run(client -> client.getTableObjectsByName(database, 
tableNames));
-tableIdentifiers =
-tableObjects.stream()
-.filter(
-table ->
-table.getParameters() != null
-&& 
BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE
-.equalsIgnoreCase(
-table
-.getParameters()
-
.get(BaseMetastoreTableOperations.TABLE_TYPE_PROP)))
-.map(table -> TableIdentifier.of(namespace, 
table.getTableName()))
-.collect(Collectors.toList());
+return listContents(namespace, TableType.EXTERNAL_TABLE.name(), 
icebergPredicate());
   }
-
-  LOG.debug(
-  "Listing of namespace: {} resulted in the following tables: {}",
-  namespace,
-  tableIdentifiers);

Review Comment:
   Did we intentionally remove this log line?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383338484


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -264,6 +251,146 @@ public void renameTable(TableIdentifier from, 
TableIdentifier originalTo) {
 }
   }
 
+  @Override
+  public boolean viewExists(TableIdentifier identifier) {
+return HiveCatalogUtil.isTableWithTypeExists(clients, identifier, 
TableType.VIRTUAL_VIEW);
+  }
+
+  @Override
+  public boolean dropView(TableIdentifier identifier) {
+if (!isValidIdentifier(identifier)) {
+  return false;
+}

Review Comment:
   nit: newline



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383343370


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -264,6 +251,146 @@ public void renameTable(TableIdentifier from, 
TableIdentifier originalTo) {
 }
   }
 
+  @Override
+  public boolean viewExists(TableIdentifier identifier) {
+return HiveCatalogUtil.isTableWithTypeExists(clients, identifier, 
TableType.VIRTUAL_VIEW);
+  }
+
+  @Override
+  public boolean dropView(TableIdentifier identifier) {
+if (!isValidIdentifier(identifier)) {
+  return false;
+}
+try {
+  String database = identifier.namespace().level(0);
+  String viewName = identifier.name();
+  Table table = clients.run(client -> client.getTable(database, viewName));
+  HiveCatalogUtil.validateTableIsIcebergView(table, fullTableName(name, 
identifier));
+  clients.run(
+  client -> {
+client.dropTable(database, viewName);
+return null;
+  });
+  LOG.info("Dropped View: {}", identifier);
+  return true;
+
+} catch (NoSuchViewException | NoSuchObjectException e) {
+  LOG.info("Skipping drop, View does not exist: {}", identifier, e);
+  return false;
+} catch (TException e) {
+  throw new RuntimeException("Failed to drop " + identifier, e);
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  throw new RuntimeException("Interrupted in call to dropView", e);
+}
+  }
+
+  @Override
+  public List listViews(Namespace namespace) {
+try {
+  return listContents(namespace, TableType.VIRTUAL_VIEW.name(), 
icebergPredicate());
+} catch (UnknownDBException e) {
+  throw new NoSuchNamespaceException("Namespace does not exist: %s", 
namespace);
+
+} catch (TException e) {
+  throw new RuntimeException("Failed to list all views under namespace " + 
namespace, e);
+
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  throw new RuntimeException("Interrupted in call to listViews", e);
+}
+  }
+
+  private List listContents(
+  Namespace namespace, String tableType, Predicate tablePredicate)
+  throws TException, InterruptedException {
+Preconditions.checkArgument(
+isValidateNamespace(namespace), "Missing database in namespace: %s", 
namespace);
+String database = namespace.level(0);
+List tableNames =
+StringUtils.isNotEmpty(tableType)
+? clients.run(client -> client.getTables(database, "*", 
TableType.valueOf(tableType)))
+: clients.run(client -> client.getAllTables(database));
+List tableObjects =
+clients.run(client -> client.getTableObjectsByName(database, 
tableNames));
+List tableIdentifiers =
+tableObjects.stream()
+.filter(tablePredicate)
+.map(table -> TableIdentifier.of(namespace, table.getTableName()))
+.collect(Collectors.toList());
+
+LOG.debug(
+"Listing of namespace: {} for table type {} resulted in the following: 
{}",
+namespace,
+tableType,
+tableIdentifiers);
+return tableIdentifiers;
+  }
+
+  private Predicate icebergPredicate() {
+return table ->
+table.getParameters() != null
+&& 
BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.equalsIgnoreCase(
+
table.getParameters().get(BaseMetastoreTableOperations.TABLE_TYPE_PROP));
+  }
+
+  @Override
+  @SuppressWarnings("FormatStringAnnotation")
+  public void renameView(TableIdentifier from, TableIdentifier originalTo) {
+
+if (!isValidIdentifier(from)) {
+  throw new NoSuchViewException("Invalid identifier: %s", from);
+}
+
+if (!namespaceExists(originalTo.namespace())) {
+  throw new NoSuchNamespaceException(
+  "Cannot rename %s to %s. Namespace does not exist: %s",
+  from, originalTo, originalTo.namespace());
+}
+
+TableIdentifier to = removeCatalogName(originalTo);
+Preconditions.checkArgument(isValidIdentifier(to), "Invalid identifier: 
%s", to);
+
+String toDatabase = to.namespace().level(0);
+String fromDatabase = from.namespace().level(0);
+String fromName = from.name();
+
+try {
+  Table fromView = clients.run(client -> client.getTable(fromDatabase, 
fromName));
+  HiveCatalogUtil.validateTableIsIcebergView(fromView, fullTableName(name, 
from));
+  if (tableExists(to)) {
+LOG.warn("Cannot rename view {} to {}. Table {} already exists.", 
from, to, to);
+throw new AlreadyExistsException(
+String.format("Cannot rename %s to %s. Table already exists", 
from, to));
+  }

Review Comment:
   nit: newline



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@ic

Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383344359


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -264,6 +251,146 @@ public void renameTable(TableIdentifier from, 
TableIdentifier originalTo) {
 }
   }
 
+  @Override
+  public boolean viewExists(TableIdentifier identifier) {
+return HiveCatalogUtil.isTableWithTypeExists(clients, identifier, 
TableType.VIRTUAL_VIEW);
+  }
+
+  @Override
+  public boolean dropView(TableIdentifier identifier) {
+if (!isValidIdentifier(identifier)) {
+  return false;
+}
+try {
+  String database = identifier.namespace().level(0);
+  String viewName = identifier.name();
+  Table table = clients.run(client -> client.getTable(database, viewName));
+  HiveCatalogUtil.validateTableIsIcebergView(table, fullTableName(name, 
identifier));
+  clients.run(
+  client -> {
+client.dropTable(database, viewName);
+return null;
+  });
+  LOG.info("Dropped View: {}", identifier);
+  return true;
+
+} catch (NoSuchViewException | NoSuchObjectException e) {
+  LOG.info("Skipping drop, View does not exist: {}", identifier, e);
+  return false;
+} catch (TException e) {
+  throw new RuntimeException("Failed to drop " + identifier, e);
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  throw new RuntimeException("Interrupted in call to dropView", e);
+}
+  }
+
+  @Override
+  public List listViews(Namespace namespace) {
+try {
+  return listContents(namespace, TableType.VIRTUAL_VIEW.name(), 
icebergPredicate());
+} catch (UnknownDBException e) {
+  throw new NoSuchNamespaceException("Namespace does not exist: %s", 
namespace);
+
+} catch (TException e) {
+  throw new RuntimeException("Failed to list all views under namespace " + 
namespace, e);
+
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  throw new RuntimeException("Interrupted in call to listViews", e);
+}
+  }
+
+  private List listContents(
+  Namespace namespace, String tableType, Predicate tablePredicate)
+  throws TException, InterruptedException {
+Preconditions.checkArgument(
+isValidateNamespace(namespace), "Missing database in namespace: %s", 
namespace);
+String database = namespace.level(0);
+List tableNames =
+StringUtils.isNotEmpty(tableType)
+? clients.run(client -> client.getTables(database, "*", 
TableType.valueOf(tableType)))
+: clients.run(client -> client.getAllTables(database));
+List tableObjects =
+clients.run(client -> client.getTableObjectsByName(database, 
tableNames));
+List tableIdentifiers =
+tableObjects.stream()
+.filter(tablePredicate)
+.map(table -> TableIdentifier.of(namespace, table.getTableName()))
+.collect(Collectors.toList());
+
+LOG.debug(
+"Listing of namespace: {} for table type {} resulted in the following: 
{}",
+namespace,
+tableType,
+tableIdentifiers);
+return tableIdentifiers;
+  }
+
+  private Predicate icebergPredicate() {
+return table ->
+table.getParameters() != null
+&& 
BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.equalsIgnoreCase(
+
table.getParameters().get(BaseMetastoreTableOperations.TABLE_TYPE_PROP));
+  }
+
+  @Override
+  @SuppressWarnings("FormatStringAnnotation")
+  public void renameView(TableIdentifier from, TableIdentifier originalTo) {
+
+if (!isValidIdentifier(from)) {
+  throw new NoSuchViewException("Invalid identifier: %s", from);
+}
+
+if (!namespaceExists(originalTo.namespace())) {
+  throw new NoSuchNamespaceException(
+  "Cannot rename %s to %s. Namespace does not exist: %s",
+  from, originalTo, originalTo.namespace());
+}
+
+TableIdentifier to = removeCatalogName(originalTo);
+Preconditions.checkArgument(isValidIdentifier(to), "Invalid identifier: 
%s", to);
+
+String toDatabase = to.namespace().level(0);
+String fromDatabase = from.namespace().level(0);
+String fromName = from.name();
+
+try {
+  Table fromView = clients.run(client -> client.getTable(fromDatabase, 
fromName));
+  HiveCatalogUtil.validateTableIsIcebergView(fromView, fullTableName(name, 
from));
+  if (tableExists(to)) {
+LOG.warn("Cannot rename view {} to {}. Table {} already exists.", 
from, to, to);
+throw new AlreadyExistsException(
+String.format("Cannot rename %s to %s. Table already exists", 
from, to));
+  }
+  if (viewExists(to)) {

Review Comment:
   We are double fetching the table from the HMS. Do we have a better solution?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and us

Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383346182


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalogUtil.java:
##
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.util.List;
+import org.apache.hadoop.hive.metastore.IMetaStoreClient;
+import org.apache.hadoop.hive.metastore.TableType;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.iceberg.BaseMetastoreTableOperations;
+import org.apache.iceberg.ClientPool;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.NoSuchIcebergTableException;
+import org.apache.iceberg.exceptions.NoSuchIcebergViewException;
+import org.apache.thrift.TException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** A utility class to validate Hive Iceberg Table and Views. */
+final class HiveCatalogUtil {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveCatalogUtil.class);
+
+  // the max size is based on HMS backend database. For Hive versions below 
2.3, the max table
+  // parameter size is 4000
+  // characters, see https://issues.apache.org/jira/browse/HIVE-12274
+  // set to 0 to not expose Iceberg metadata in HMS Table properties.

Review Comment:
   I do not get this comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383350384


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalogUtil.java:
##
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.util.List;
+import org.apache.hadoop.hive.metastore.IMetaStoreClient;
+import org.apache.hadoop.hive.metastore.TableType;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.iceberg.BaseMetastoreTableOperations;
+import org.apache.iceberg.ClientPool;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.NoSuchIcebergTableException;
+import org.apache.iceberg.exceptions.NoSuchIcebergViewException;
+import org.apache.thrift.TException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** A utility class to validate Hive Iceberg Table and Views. */
+final class HiveCatalogUtil {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveCatalogUtil.class);
+
+  // the max size is based on HMS backend database. For Hive versions below 
2.3, the max table
+  // parameter size is 4000
+  // characters, see https://issues.apache.org/jira/browse/HIVE-12274
+  // set to 0 to not expose Iceberg metadata in HMS Table properties.
+  static final String HIVE_TABLE_PROPERTY_MAX_SIZE = 
"iceberg.hive.table-property-max-size";
+  static final long HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT = 32672;
+
+  private HiveCatalogUtil() {
+// empty constructor for utility class
+  }
+
+  static boolean isTableWithTypeExists(
+  ClientPool clients,
+  TableIdentifier identifier,
+  TableType tableType) {
+String database = identifier.namespace().level(0);
+String tableName = identifier.name();
+try {
+  List tables = clients.run(client -> client.getTables(database, 
tableName, tableType));
+  return !tables.isEmpty();
+} catch (TException e) {
+  throw new RuntimeException(
+  "Failed to check table existence " + database + "." + tableName, e);
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  throw new RuntimeException("Interrupted in call to listTables", e);
+}
+  }
+
+  static void validateTableIsIcebergView(Table table, String fullName) {
+String tableType = 
table.getParameters().get(BaseMetastoreTableOperations.TABLE_TYPE_PROP);
+NoSuchIcebergViewException.check(
+table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())
+&& tableType != null
+&& 
tableType.equalsIgnoreCase(BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE),
+"Not an iceberg view: %s (type=%s) (tableType=%s)",
+fullName,
+tableType,
+table.getTableType());
+  }
+
+  static void validateTableIsIceberg(Table table, String fullName) {
+if (table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) {
+  throw new AlreadyExistsException(
+  "View with same name already exists: %s.%s", table.getDbName(), 
table.getTableName());
+}

Review Comment:
   nit: newline



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383356758


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalogUtil.java:
##
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.util.List;
+import org.apache.hadoop.hive.metastore.IMetaStoreClient;
+import org.apache.hadoop.hive.metastore.TableType;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.iceberg.BaseMetastoreTableOperations;
+import org.apache.iceberg.ClientPool;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.NoSuchIcebergTableException;
+import org.apache.iceberg.exceptions.NoSuchIcebergViewException;
+import org.apache.thrift.TException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** A utility class to validate Hive Iceberg Table and Views. */
+final class HiveCatalogUtil {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveCatalogUtil.class);
+
+  // the max size is based on HMS backend database. For Hive versions below 
2.3, the max table
+  // parameter size is 4000
+  // characters, see https://issues.apache.org/jira/browse/HIVE-12274
+  // set to 0 to not expose Iceberg metadata in HMS Table properties.
+  static final String HIVE_TABLE_PROPERTY_MAX_SIZE = 
"iceberg.hive.table-property-max-size";
+  static final long HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT = 32672;
+
+  private HiveCatalogUtil() {
+// empty constructor for utility class
+  }
+
+  static boolean isTableWithTypeExists(
+  ClientPool clients,
+  TableIdentifier identifier,
+  TableType tableType) {
+String database = identifier.namespace().level(0);
+String tableName = identifier.name();
+try {
+  List tables = clients.run(client -> client.getTables(database, 
tableName, tableType));
+  return !tables.isEmpty();
+} catch (TException e) {
+  throw new RuntimeException(
+  "Failed to check table existence " + database + "." + tableName, e);
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  throw new RuntimeException("Interrupted in call to listTables", e);
+}
+  }
+
+  static void validateTableIsIcebergView(Table table, String fullName) {
+String tableType = 
table.getParameters().get(BaseMetastoreTableOperations.TABLE_TYPE_PROP);
+NoSuchIcebergViewException.check(
+table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())
+&& tableType != null
+&& 
tableType.equalsIgnoreCase(BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE),
+"Not an iceberg view: %s (type=%s) (tableType=%s)",
+fullName,
+tableType,
+table.getTableType());
+  }
+
+  static void validateTableIsIceberg(Table table, String fullName) {
+if (table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) {
+  throw new AlreadyExistsException(

Review Comment:
   This is a new type of exception we are throwing here.
   We need to double check, that it is handled correctly everywhere



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383360341


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java:
##
@@ -0,0 +1,389 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.util.Collections;
+import java.util.Locale;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.concurrent.atomic.AtomicReference;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.IMetaStoreClient;
+import org.apache.hadoop.hive.metastore.TableType;
+import org.apache.hadoop.hive.metastore.api.InvalidObjectException;
+import org.apache.hadoop.hive.metastore.api.NoSuchObjectException;
+import org.apache.hadoop.hive.metastore.api.SerDeInfo;
+import org.apache.hadoop.hive.metastore.api.StorageDescriptor;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.iceberg.BaseMetastoreTableOperations;
+import org.apache.iceberg.ClientPool;
+import org.apache.iceberg.SchemaParser;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.CommitFailedException;
+import org.apache.iceberg.exceptions.CommitStateUnknownException;
+import org.apache.iceberg.exceptions.NoSuchViewException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.hive.HiveCatalogUtil.CommitStatus;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.view.BaseViewOperations;
+import org.apache.iceberg.view.ViewMetadata;
+import org.apache.thrift.TException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** Hive implementation of Iceberg ViewOperations. */
+final class HiveViewOperations extends BaseViewOperations {
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveViewOperations.class);
+
+  private final String fullName;
+  private final String database;
+  private final String viewName;
+  private final FileIO fileIO;
+  private final ClientPool metaClients;
+  private final long maxHiveTablePropertySize;
+  private final TableIdentifier identifier;
+
+  HiveViewOperations(
+  Configuration conf,
+  ClientPool metaClients,
+  FileIO fileIO,
+  String catalogName,
+  TableIdentifier viewIdentifier) {
+this.identifier = viewIdentifier;
+String dbName = viewIdentifier.namespace().level(0);
+this.metaClients = metaClients;
+this.fileIO = fileIO;
+this.fullName = catalogName + "." + dbName + "." + viewIdentifier.name();
+this.database = dbName;
+this.viewName = viewIdentifier.name();
+this.maxHiveTablePropertySize =
+conf.getLong(
+HiveCatalogUtil.HIVE_TABLE_PROPERTY_MAX_SIZE,
+HiveCatalogUtil.HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT);
+  }
+
+  @Override
+  public ViewMetadata current() {
+if (HiveCatalogUtil.isTableWithTypeExists(metaClients, identifier, 
TableType.EXTERNAL_TABLE)) {
+  throw new AlreadyExistsException(
+  "Table with same name already exists: %s.%s", database, viewName);
+}
+return super.current();
+  }
+
+  @Override
+  public void doRefresh() {
+String metadataLocation = null;
+try {
+  Table table = metaClients.run(client -> client.getTable(database, 
viewName));
+  HiveCatalogUtil.validateTableIsIcebergView(table, fullName);
+  metadataLocation =
+  
table.getParameters().get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP);
+} catch (NoSuchObjectException e) {
+  if (currentMetadataLocation() != null) {
+throw new NoSuchViewException("View does not exist: %s.%s", database, 
viewName);
+  }
+} catch (TException e) {
+  String errMsg =
+  String.format("Failed to get view info from metastore %s.%s", 
database,

Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383362391


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java:
##
@@ -0,0 +1,389 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.util.Collections;
+import java.util.Locale;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.concurrent.atomic.AtomicReference;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.IMetaStoreClient;
+import org.apache.hadoop.hive.metastore.TableType;
+import org.apache.hadoop.hive.metastore.api.InvalidObjectException;
+import org.apache.hadoop.hive.metastore.api.NoSuchObjectException;
+import org.apache.hadoop.hive.metastore.api.SerDeInfo;
+import org.apache.hadoop.hive.metastore.api.StorageDescriptor;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.iceberg.BaseMetastoreTableOperations;
+import org.apache.iceberg.ClientPool;
+import org.apache.iceberg.SchemaParser;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.CommitFailedException;
+import org.apache.iceberg.exceptions.CommitStateUnknownException;
+import org.apache.iceberg.exceptions.NoSuchViewException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.hive.HiveCatalogUtil.CommitStatus;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.view.BaseViewOperations;
+import org.apache.iceberg.view.ViewMetadata;
+import org.apache.thrift.TException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** Hive implementation of Iceberg ViewOperations. */
+final class HiveViewOperations extends BaseViewOperations {

Review Comment:
   How big is the duplicated code here?
   Do we want to have a common ancestor for `HiveViewOperations` and 
`HiveTableOperations`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-11-06 Thread via GitHub


ajantha-bhat commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1383363198


##
core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java:
##
@@ -400,8 +400,15 @@ public void 
replaceTableViaTransactionThatAlreadyExistsAsView() {
 .buildTable(viewIdentifier, SCHEMA)
 .replaceTransaction()
 .commitTransaction())
-.isInstanceOf(NoSuchTableException.class)
-.hasMessageStartingWith("Table does not exist: ns.view");
+.satisfiesAnyOf(
+throwable ->
+assertThat(throwable)
+.isInstanceOf(NoSuchTableException.class)
+.hasMessageStartingWith("Table does not exist: ns.view"),
+throwable ->
+assertThat(throwable)

Review Comment:
   
   `replaceTableViaTransactionThatAlreadyExistsAsView`
`NessieTableOperations.doRefresh() `--> throws 
`AlreadyExistsException`. But expecting `NoSuchViewException`.
   
   If I fix it, another test case 
(createOrReplaceTableViaTransactionThatAlreadyExistsAsView) fails.
   Because from the same place `doRefresh()` we are expecting two different 
kind of exceptions. I think test case need to be modified instead of unifying 
code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8918:
URL: https://github.com/apache/iceberg/pull/8918#discussion_r1383401740


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -261,6 +261,12 @@ public void renameTable(TableIdentifier from, 
TableIdentifier originalTo) {
 } catch (InterruptedException e) {
   Thread.currentThread().interrupt();
   throw new RuntimeException("Interrupted in call to rename", e);
+} catch (RuntimeException e) {

Review Comment:
   Taking a quick look, I do not like this change.
   
   Why is this `RuntimeException`?
   
   This code tries to capture this case:
   ```
   } catch (AlreadyExistsException e) {
 throw new org.apache.iceberg.exceptions.AlreadyExistsException(
 "Table already exists: %s", to);
   }
   ```
   
   Do we have checks for the other cases as well?
   My first guess would be that `MetastoreUtil.alterTable` messes up the Hive 
exceptions, and maybe all of them are off?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8918:
URL: https://github.com/apache/iceberg/pull/8918#discussion_r1383403517


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -261,6 +261,12 @@ public void renameTable(TableIdentifier from, 
TableIdentifier originalTo) {
 } catch (InterruptedException e) {
   Thread.currentThread().interrupt();
   throw new RuntimeException("Interrupted in call to rename", e);
+} catch (RuntimeException e) {

Review Comment:
   Also maybe the exceptions thrown are different with different Hive versions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8918:
URL: https://github.com/apache/iceberg/pull/8918#discussion_r1383411335


##
hive-metastore/src/test/java/org/apache/iceberg/hive/HiveMetastoreTest.java:
##
@@ -31,6 +31,10 @@
 import org.junit.jupiter.api.AfterAll;
 import org.junit.jupiter.api.BeforeAll;
 
+/*
+ * This meta-setup has been deprecated; use {@link HiveMetastoreExtension} 
instead.
+ * */

Review Comment:
   nit: `* */`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8918:
URL: https://github.com/apache/iceberg/pull/8918#discussion_r1383409990


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -500,6 +511,9 @@ protected String defaultWarehouseLocation(TableIdentifier 
tableIdentifier) {
 return String.format("%s/%s", databaseData.getLocationUri(), 
tableIdentifier.name());
   }
 
+} catch (NoSuchObjectException e) {
+  throw new NoSuchNamespaceException(
+  e, "Namespace does not exist: %s", 
tableIdentifier.namespace().levels()[0]);

Review Comment:
   Just to double check, the exceptions might be different with Hive2 and Hive3.
   Do we run the tests with both Hive2 and Hive3 on CI?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8918:
URL: https://github.com/apache/iceberg/pull/8918#discussion_r1383413298


##
hive-metastore/src/test/java/org/apache/iceberg/hive/HiveMetastoreTest.java:
##
@@ -31,6 +31,10 @@
 import org.junit.jupiter.api.AfterAll;
 import org.junit.jupiter.api.BeforeAll;
 
+/*
+ * This meta-setup has been deprecated; use {@link HiveMetastoreExtension} 
instead.
+ * */
+@Deprecated

Review Comment:
   After this could you please create a PR to remove this class?
   Or in this one, if it is trivial...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] docs(readme): Add feature roadmap and support to readme [iceberg-go]

2023-11-06 Thread via GitHub


zeroshade opened a new pull request, #32:
URL: https://github.com/apache/iceberg-go/pull/32

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] docs(readme): Add feature roadmap and support to readme [iceberg-go]

2023-11-06 Thread via GitHub


zeroshade commented on PR #32:
URL: https://github.com/apache/iceberg-go/pull/32#issuecomment-1794974142

   CC @nastra @Fokko @rdblue @coded9 @bitsondatadev @wolfeidau 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Cannot decode dictionary of type INT96 when reading imported Spark parquet table [iceberg]

2023-11-06 Thread via GitHub


manuzhang opened a new issue, #8990:
URL: https://github.com/apache/iceberg/issues/8990

   ### Apache Iceberg version
   
   1.2.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   The following exception was thrown when reading an imported Spark parquet 
table with filter.
   
   ```
   java.lang.IllegalArgumentException: Cannot decode dictionary of type: INT96
   at 
org.apache.iceberg.parquet.ParquetDictionaryRowGroupFilter$EvalVisitor.dict(ParquetDictionaryRowGroupFilter.java:458)
   at 
org.apache.iceberg.parquet.ParquetDictionaryRowGroupFilter$EvalVisitor.eq(ParquetDictionaryRowGroupFilter.java:293)
   at 
org.apache.iceberg.parquet.ParquetDictionaryRowGroupFilter$EvalVisitor.eq(ParquetDictionaryRowGroupFilter.java:80)
   at 
org.apache.iceberg.expressions.ExpressionVisitors$BoundExpressionVisitor.predicate(ExpressionVisitors.java:162)
   at 
org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:390)
   at 
org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:409)
   at 
org.apache.iceberg.parquet.ParquetDictionaryRowGroupFilter$EvalVisitor.eval(ParquetDictionaryRowGroupFilter.java:118)
   at 
org.apache.iceberg.parquet.ParquetDictionaryRowGroupFilter$EvalVisitor.access$100(ParquetDictionaryRowGroupFilter.java:80)
   at 
org.apache.iceberg.parquet.ParquetDictionaryRowGroupFilter.shouldRead(ParquetDictionaryRowGroupFilter.java:74)
   at org.apache.iceberg.parquet.ReadConf.(ReadConf.java:119)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] docs(readme): Add feature roadmap and support to readme [iceberg-go]

2023-11-06 Thread via GitHub


nastra merged PR #32:
URL: https://github.com/apache/iceberg-go/pull/32


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] support partition spec update in pyiceberg [iceberg-python]

2023-11-06 Thread via GitHub


puchengy commented on issue #124:
URL: https://github.com/apache/iceberg-python/issues/124#issuecomment-1795046924

   @Fokko Hi, we are interested in Hive catalog and rest catalog in future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Iceberg streaming streaming-skip-overwrite-snapshots SparkMicroBatchStream only skips over one file per trigger [iceberg]

2023-11-06 Thread via GitHub


nastra commented on code in PR #8980:
URL: https://github.com/apache/iceberg/pull/8980#discussion_r1383492951


##
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreamingRead3.java:
##
@@ -497,6 +500,67 @@ public void 
testReadStreamWithSnapshotTypeOverwriteErrorsOut() throws Exception
 .hasMessageStartingWith("Cannot process overwrite snapshot");
   }
 
+  @Test
+  public void testReadStreamWithSnapshotTypeRewriteDataFilesIgnoresReplace() 
throws Exception {
+// fill table with some data
+List> expected = TEST_DATA_MULTIPLE_SNAPSHOTS;
+appendDataAsMultipleSnapshots(expected);
+
+makeRewriteDataFiles();
+
+Iterable snapshots = table.snapshots();
+for (Snapshot s : snapshots) {
+  System.out.println(s.snapshotId());
+}
+
+Assert.assertEquals(
+6,
+microBatchCount(
+
ImmutableMap.of(SparkReadOptions.STREAMING_MAX_FILES_PER_MICRO_BATCH, "1")));
+  }
+
+  @Test
+  public void testReadStreamWithSnapshotType2RewriteDataFilesIgnoresReplace() 
throws Exception {
+// fill table with some data
+List> expected = TEST_DATA_MULTIPLE_SNAPSHOTS;
+appendDataAsMultipleSnapshots(expected);
+
+makeRewriteDataFiles();
+makeRewriteDataFiles();
+
+Iterable snapshots = table.snapshots();
+for (Snapshot s : snapshots) {
+  System.out.println(s.snapshotId());
+}
+
+Assert.assertEquals(
+6,
+microBatchCount(
+
ImmutableMap.of(SparkReadOptions.STREAMING_MAX_FILES_PER_MICRO_BATCH, "1")));
+  }
+
+  @Test
+  public void 
testReadStreamWithSnapshotTypeRewriteDataFilesIgnoresReplaceFollowedByAppend()
+  throws Exception {
+// fill table with some data
+List> expected = TEST_DATA_MULTIPLE_SNAPSHOTS;
+appendDataAsMultipleSnapshots(expected);
+
+makeRewriteDataFiles();
+
+appendDataAsMultipleSnapshots(expected);
+
+Iterable snapshots = table.snapshots();
+for (Snapshot s : snapshots) {
+  System.out.println(s.snapshotId());
+}
+
+Assert.assertEquals(

Review Comment:
   please use AssertJ-style assertions for newly written test code: 
https://iceberg.apache.org/contribute/#assertj



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-11-06 Thread via GitHub


bryanck commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1383589051


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/AvroUtil.java:
##
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.Map;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.PartitionData;
+import org.apache.iceberg.avro.AvroEncoderUtil;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+import org.apache.iceberg.data.avro.DecoderResolver;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.types.Types;
+
+/** Class for Avro-related utility methods. */
+class AvroUtil {
+  static final Map FIELD_ID_TO_CLASS =
+  ImmutableMap.of(
+  DataComplete.ASSIGNMENTS_ELEMENT,
+  TopicPartitionOffset.class.getName(),
+  DataFile.PARTITION_ID,
+  PartitionData.class.getName(),
+  DataWritten.TABLE_REFERENCE,
+  TableReference.class.getName(),
+  DataWritten.DATA_FILES_ELEMENT,
+  "org.apache.iceberg.GenericDataFile",
+  DataWritten.DELETE_FILES_ELEMENT,
+  "org.apache.iceberg.GenericDeleteFile",
+  CommitToTable.TABLE_REFERENCE,
+  TableReference.class.getName());
+
+  public static byte[] encode(Event event) {
+try {
+  return AvroEncoderUtil.encode(event, event.getSchema());
+} catch (IOException e) {
+  throw new UncheckedIOException(e);
+}
+  }
+
+  public static Event decode(byte[] bytes) {
+try {
+  Event event = AvroEncoderUtil.decode(bytes);
+  // clear the cache to avoid memory leak
+  DecoderResolver.clearCache();
+  return event;
+} catch (IOException e) {
+  throw new UncheckedIOException(e);
+}
+  }
+
+  static Schema convert(Types.StructType icebergSchema, Class javaClass) {
+return convert(icebergSchema, javaClass, FIELD_ID_TO_CLASS);
+  }
+
+  static Schema convert(
+  Types.StructType icebergSchema,
+  Class javaClass,
+  Map typeMap) {
+return AvroSchemaUtil.convert(
+icebergSchema,
+(fieldId, struct) ->
+struct.equals(icebergSchema) ? javaClass.getName() : 
typeMap.get(fieldId));
+  }
+
+  static int positionToId(int position, Schema avroSchema) {
+List fields = avroSchema.getFields();
+Preconditions.checkArgument(position < fields.size(), "Invalid field 
position: " + position);

Review Comment:
   yes, I added this check also, thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Docs: Update site-docs/spark-quickstart.md [iceberg]

2023-11-06 Thread via GitHub


stavdav143 opened a new pull request, #8991:
URL: https://github.com/apache/iceberg/pull/8991

   Local volume with warehouse/notebooks  to be mounted on Minio service 
instead of Spark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8803:
URL: https://github.com/apache/iceberg/pull/8803#discussion_r1383682090


##
api/src/main/java/org/apache/iceberg/Scan.java:
##
@@ -77,6 +78,21 @@ public interface Scan> {
*/
   ThisT includeColumnStats();
 
+  /**
+   * Create a new scan from this that loads the column stats for the specific 
columns with each data
+   * file. If the columns set is empty or null then all column 
stats will be kept, if
+   * {@link #includeColumnStats()} is set.
+   *
+   * Column stats include: value count, null value count, lower bounds, and 
upper bounds.
+   *
+   * @param columnsToKeepStats column ids from the table's schema
+   * @return a new scan based on this that loads column stats for specific 
columns.
+   */
+  default ThisT columnsToKeepStats(Set columnsToKeepStats) {

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8803:
URL: https://github.com/apache/iceberg/pull/8803#discussion_r1383682326


##
core/src/main/java/org/apache/iceberg/GenericDataFile.java:
##
@@ -66,23 +68,31 @@ class GenericDataFile extends BaseFile implements 
DataFile {
* Copy constructor.
*
* @param toCopy a generic data file to copy.
-   * @param fullCopy whether to copy all fields or to drop column-level stats
+   * @param copyStats whether to copy all fields or to drop column-level stats.
+   * @param columnsToKeepStats a set of column ids to keep stats. If empty or 
null then
+   * every column stat is kept.
*/
-  private GenericDataFile(GenericDataFile toCopy, boolean fullCopy) {
-super(toCopy, fullCopy);
+  private GenericDataFile(
+  GenericDataFile toCopy, boolean copyStats, Set 
columnsToKeepStats) {

Review Comment:
   Done



##
core/src/main/java/org/apache/iceberg/GenericDataFile.java:
##
@@ -66,23 +68,31 @@ class GenericDataFile extends BaseFile implements 
DataFile {
* Copy constructor.
*
* @param toCopy a generic data file to copy.
-   * @param fullCopy whether to copy all fields or to drop column-level stats
+   * @param copyStats whether to copy all fields or to drop column-level stats.
+   * @param columnsToKeepStats a set of column ids to keep stats. If empty or 
null then
+   * every column stat is kept.
*/
-  private GenericDataFile(GenericDataFile toCopy, boolean fullCopy) {
-super(toCopy, fullCopy);
+  private GenericDataFile(
+  GenericDataFile toCopy, boolean copyStats, Set 
columnsToKeepStats) {
+super(toCopy, copyStats, columnsToKeepStats);
   }
 
   /** Constructor for Java serialization. */
   GenericDataFile() {}
 
   @Override
   public DataFile copyWithoutStats() {
-return new GenericDataFile(this, false /* drop stats */);
+return new GenericDataFile(this, false /* drop stats */, 
ImmutableSet.of());
+  }
+
+  @Override
+  public DataFile copyWithStats(Set columnsToKeepStats) {
+return new GenericDataFile(this, true, columnsToKeepStats);
   }
 
   @Override
   public DataFile copy() {
-return new GenericDataFile(this, true /* full copy */);
+return new GenericDataFile(this, true /* full copy */, ImmutableSet.of());

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8803:
URL: https://github.com/apache/iceberg/pull/8803#discussion_r1383682641


##
api/src/main/java/org/apache/iceberg/Scan.java:
##
@@ -77,6 +78,21 @@ public interface Scan> {
*/
   ThisT includeColumnStats();
 
+  /**
+   * Create a new scan from this that loads the column stats for the specific 
columns with each data
+   * file. If the columns set is empty or null then all column 
stats will be kept, if
+   * {@link #includeColumnStats()} is set.
+   *
+   * Column stats include: value count, null value count, lower bounds, and 
upper bounds.
+   *
+   * @param columnsToKeepStats column ids from the table's schema

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

2023-11-06 Thread via GitHub


pvary commented on code in PR #8803:
URL: https://github.com/apache/iceberg/pull/8803#discussion_r1383715528


##
core/src/main/java/org/apache/iceberg/util/ContentFileUtil.java:
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.util;
+
+import java.util.Set;
+import org.apache.iceberg.ContentFile;
+
+public class ContentFileUtil {
+  private ContentFileUtil() {}
+
+  /**
+   * Copies the {@link ContentFile} with the specific stat settings.
+   *
+   * @param file a generic data file to copy.
+   * @param withStats whether to keep any stats
+   * @param columnsToKeepStats a set of column ids to keep stats. If empty or 
null then
+   * every column stat is kept.

Review Comment:
   I converted the `Scan` to use the column names instead of the column ids, 
like: `Scan.includeColumnStats(Collection requestedColumns)`
   
   For the `ContentFile` we do not have the `Schema` at hand, so I had to stick 
to the `ContentFile.copyWithStats(Set requestedColumnIds)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Implement pre-existing session support for dynamodb catalog [iceberg-python]

2023-11-06 Thread via GitHub


waifairer commented on code in PR #104:
URL: https://github.com/apache/iceberg-python/pull/104#discussion_r1383886741


##
mkdocs/docs/configuration.md:
##
@@ -195,6 +195,19 @@ catalog:
 table-name: iceberg
 ```
 
+If you prefer to pass the credentials explicitly to the client instead of 
relying on environment variables,
+
+```yaml
+catalog:
+  default:
+type: dynamodb
+table-name: iceberg

Review Comment:
   @HonahX Definitely agreed with `dynamo` as a prefix. 
   
   As for hyphens vs underscores, AWS is _really_ consistent about using 
underscores. I'm of the opinion that the AWS-based credentials should support 
both underscores and hyphens, will prefer hyphens if present, but fall back to 
the underscore usages if necessary. Documentation should only present the 
hyphenated case as an option. I believe that this strategy would lead to the 
least number of "head banging" debug sessions. However, I think a reasonable 
case could be made to remove underscore support instead of supporting it as a 
fallback. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Adding new columns (mergeSchema) [iceberg]

2023-11-06 Thread via GitHub


FabricioZGalvani commented on issue #8908:
URL: https://github.com/apache/iceberg/issues/8908#issuecomment-1796301185

   Hello everyone,
   
   After several attempts, I managed to solve the mergeSchema issue I was 
facing. The solution was to apply the following configuration. I suspect that 
the check-ordering is set to true by default, which can cause errors when 
columns are out of order. This happens even when following the documentation's 
instructions. Without applying the configuration below, I was receiving the 
'column is out of order' error.
   
   `--conf spark.sql.iceberg.check-ordering=false`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Clarify which columns can be used for equality delete files. [iceberg]

2023-11-06 Thread via GitHub


emkornfield commented on code in PR #8981:
URL: https://github.com/apache/iceberg/pull/8981#discussion_r1384024269


##
format/spec.md:
##
@@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` 
then `pos` to optimize
 
 Equality delete files identify deleted rows in a collection of data files by 
one or more column values, and may optionally contain additional columns of the 
deleted row.
 
-Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). Float and double columns cannot 
be used as delete columns in equality delete files.
+Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). The column restrictions for 
columns used in equality delete files are the same as those for [identifier 
fields](#identifier-field-ids) with the exception that optional columns and 
columns nested under optional structs are allowed (if a 
+parent struct column is null it implies the leaf column is null).

Review Comment:
   The definition of null values present is already explained in the paragraph 
starting on line 
[850](https://github.com/apache/iceberg/pull/8981/files#diff-36347a47c3bf67ea2ef6309ea96201814032d21bb5f162dfae4045508c15588aR850),
 please let me know if you think this this is insufficient, or if there is some 
word smithing needed (I'm open to suggestions).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Clarify which columns can be used for equality delete files. [iceberg]

2023-11-06 Thread via GitHub


emkornfield commented on code in PR #8981:
URL: https://github.com/apache/iceberg/pull/8981#discussion_r1384024269


##
format/spec.md:
##
@@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` 
then `pos` to optimize
 
 Equality delete files identify deleted rows in a collection of data files by 
one or more column values, and may optionally contain additional columns of the 
deleted row.
 
-Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). Float and double columns cannot 
be used as delete columns in equality delete files.
+Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). The column restrictions for 
columns used in equality delete files are the same as those for [identifier 
fields](#identifier-field-ids) with the exception that optional columns and 
columns nested under optional structs are allowed (if a 
+parent struct column is null it implies the leaf column is null).

Review Comment:
   The definition of null values present is already explained in the paragraph 
starting on line 850, please let me know if you think this this is 
insufficient, or if there is some word smithing needed (I'm open to 
suggestions).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] [wip] de-dup props [iceberg]

2023-11-06 Thread via GitHub


thomaschow opened a new pull request, #8992:
URL: https://github.com/apache/iceberg/pull/8992

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Build: Bump mkdocs-material from 9.4.7 to 9.4.8 [iceberg-python]

2023-11-06 Thread via GitHub


dependabot[bot] opened a new pull request, #131:
URL: https://github.com/apache/iceberg-python/pull/131

   Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 
9.4.7 to 9.4.8.
   
   Release notes
   Sourced from https://github.com/squidfunk/mkdocs-material/releases";>mkdocs-material's 
releases.
   
   mkdocs-material-9.4.8
   
   Fixed invalid local address replacement when using instant loading
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6275";>#6275:
 Crash after navigation caused 404 when using instant loading
   
   
   
   
   Changelog
   Sourced from https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG";>mkdocs-material's
 changelog.
   
   mkdocs-material-9.4.8+insiders-4.43.0 (2023-11-05)
   
   Added support for GitLab committers (document contributors)
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6264";>#6264:
 Fixed compatibility with Python < 3.10
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6254";>#6254:
 Meta plugin not applying meta files to blog posts
   
   mkdocs-material-9.4.8 (2023-11-05)
   
   Fixed invalid local address replacement when using instant loading
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6275";>#6275:
 Crash after navigation caused 404 when using instant loading
   
   mkdocs-material-9.4.7+insiders-4.42.3 (2023-10-27)
   
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6251";>#6251:
 Cards in grids cut off on very small screens
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6241";>#6241:
 Using social plugin + static-i18n plugin errors
   
   mkdocs-material-9.4.7 (2023-10-27)
   
   Added Azerbaijani translations
   
   mkdocs-material-9.4.6+insiders-4.42.2 (2023-10-14)
   
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6186";>#6186:
 Privacy plugin ignores hash fragments on images
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6180";>#6180:
 Projects plugin crashing when adding or removing files
   
   mkdocs-material-9.4.6 (2023-10-14)
   
   Updated Danish and Norwegian (Nynorsk) translations
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6169";>#6169:
 Blog post metadata layout overflows on small screens
   
   mkdocs-material-9.4.5 (2023-10-10)
   
   Fixed sidebar auto-positioning (9.4.2 regression)
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6166";>#6166:
 Improve group plugin compatibility with Python < 3.10
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6157";>#6157:
 Hiding tags does not work (9.4.3 regression)
   
   mkdocs-material-9.4.4+insiders-4.42.1 (2023-10-05)
   
   Fixed spacing of related links in blog posts on small screens
   
   mkdocs-material-9.4.4 (2023-10-05)
   
   Added support for overriding text to be copied for code blocks
   Fixed broken layout in some browsers at breakpoints when using zoom
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6132";>#6132:
 Incomplete search highlighting for code blocks in titles
   
   mkdocs-material-9.4.3 (2023-10-02)
   
   Added support for instant navigation progress indicator
   Improved spacing and alignment of tags
   
   
   
   ... (truncated)
   
   
   Commits
   
   https://github.com/squidfunk/mkdocs-material/commit/c0755bf2471a3476d9592de99384abe476c4a645";>c0755bf
 Prepare 9.4.8 release
   https://github.com/squidfunk/mkdocs-material/commit/fabc9bd6b8c50001b2e2a3d9b8ee9b1fd53c5cfe";>fabc9bd
 Documentation
   https://github.com/squidfunk/mkdocs-material/commit/87d69a95b8284d60233d94563953ecf7b1dfc7ec";>87d69a9
 Fixed invalid local address when using instant loading
   https://github.com/squidfunk/mkdocs-material/commit/9a7a185f9d1d7cc4b47609846deeea26c04bd3b2";>9a7a185
 Documentation (https://redirect.github.com/squidfunk/mkdocs-material/issues/6267";>#6267)
   https://github.com/squidfunk/mkdocs-material/commit/7353c7d7cf862dec278d115bd4dbd892952d4111";>7353c7d
 Documentation (https://redirect.github.com/squidfunk/mkdocs-material/issues/6277";>#6277)
   https://github.com/squidfunk/mkdocs-material/commit/ca5f5174a312fbf80e1e1ad275d84dbebdabf4cd";>ca5f517
 Merge branch 'master' of github.com:squidfunk/mkdocs-material
   https://github.com/squidfunk/mkdocs-material/commit/494cae1e36664a5d106c1371b05e74f90703b919";>494cae1
 Fixed crash after navigation caused 404 when using instant loading
   https://github.com/squidfunk/mkdocs-material/commit/1698708b2329980453da332a2aafa39b0f654653";>1698708
 Documentation (https://redirect.github.com/squidfunk/mkdocs-material/issues/6260";>#6260)
   https://github.com/squidfunk/mkdocs-material/commit/551d98e6de12e80e6da734bb3ff2dbc85d4adf5b";>551d98e
 Documentation (https://redirect.github.com/squidfunk/mkdocs-material/issues/6222";>#6222)
   https://github.com/squidfunk/mkdocs-material/commit/dfa5f0313893ff7fc254a8d74421735d5f1d3e

Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]

2023-11-06 Thread via GitHub


jacobmarble commented on code in PR #8971:
URL: https://github.com/apache/iceberg/pull/8971#discussion_r1384122853


##
api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java:
##
@@ -589,17 +603,17 @@ private static String sanitizeNumber(Number value, String 
type) {
 return "(" + numDigits + "-digit-" + type + ")";
   }
 
-  private static String sanitizeString(CharSequence value, long now, int 
today) {
+  private static String sanitizeString(CharSequence value, long nowMillis, int 
today) {
 try {
   if (DATE.matcher(value).matches()) {
 Literal date = Literal.of(value).to(Types.DateType.get());
 return sanitizeDate(date.value(), today);
   } else if (TIMESTAMP.matcher(value).matches()) {
-Literal ts = 
Literal.of(value).to(Types.TimestampType.withoutZone());
-return sanitizeTimestamp(ts.value(), now);
+Literal ts = 
Literal.of(value).to(Types.TimestampType.nanosWithoutZone());

Review Comment:
   Resolved in 1a6cf52c1baa35e1fdd086d8907fbed7873cf317



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]

2023-11-06 Thread via GitHub


jacobmarble commented on code in PR #8971:
URL: https://github.com/apache/iceberg/pull/8971#discussion_r1384147958


##
api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java:
##
@@ -121,17 +121,13 @@ static  R visit(Schema schema, PartitionField field, 
PartitionSpecVisitor
 } else if (transform instanceof Truncate) {
   int width = ((Truncate) transform).width();
   return visitor.truncate(field.fieldId(), sourceName, field.sourceId(), 
width);
-} else if (transform == Dates.YEAR
-|| transform == Timestamps.YEAR
-|| transform instanceof Years) {
+} else if ("year".equalsIgnoreCase(transform.toString())) {

Review Comment:
   You're right. Fixed in 09c1f2534



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]

2023-11-06 Thread via GitHub


jacobmarble commented on code in PR #8971:
URL: https://github.com/apache/iceberg/pull/8971#discussion_r1384122853


##
api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java:
##
@@ -589,17 +603,17 @@ private static String sanitizeNumber(Number value, String 
type) {
 return "(" + numDigits + "-digit-" + type + ")";
   }
 
-  private static String sanitizeString(CharSequence value, long now, int 
today) {
+  private static String sanitizeString(CharSequence value, long nowMillis, int 
today) {
 try {
   if (DATE.matcher(value).matches()) {
 Literal date = Literal.of(value).to(Types.DateType.get());
 return sanitizeDate(date.value(), today);
   } else if (TIMESTAMP.matcher(value).matches()) {
-Literal ts = 
Literal.of(value).to(Types.TimestampType.withoutZone());
-return sanitizeTimestamp(ts.value(), now);
+Literal ts = 
Literal.of(value).to(Types.TimestampType.nanosWithoutZone());

Review Comment:
   Resolved in 1f95ceb31



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]

2023-11-06 Thread via GitHub


jacobmarble commented on code in PR #8971:
URL: https://github.com/apache/iceberg/pull/8971#discussion_r1382165939


##
api/src/main/java/org/apache/iceberg/transforms/Days.java:
##
@@ -55,14 +56,14 @@ public boolean satisfiesOrderOf(Transform other) {
 }
 
 if (other instanceof Timestamps) {
-  return Timestamps.DAY.satisfiesOrderOf(other);
+  return ((Timestamps) other).getResultTypeUnit() == ChronoUnit.DAYS

Review Comment:
   Fixed in 3b87435b3



##
api/src/main/java/org/apache/iceberg/transforms/Hours.java:
##
@@ -57,15 +58,16 @@ public boolean satisfiesOrderOf(Transform other) {
 }
 
 if (other instanceof Timestamps) {
-  return other == Timestamps.HOUR;
-} else if (other instanceof Hours
-|| other instanceof Days
-|| other instanceof Months
-|| other instanceof Years) {
-  return true;
+  return ((Timestamps) other).getResultTypeUnit() == ChronoUnit.HOURS

Review Comment:
   Fixed in 3b87435b3



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]

2023-11-06 Thread via GitHub


jacobmarble commented on code in PR #8971:
URL: https://github.com/apache/iceberg/pull/8971#discussion_r1382166328


##
api/src/main/java/org/apache/iceberg/transforms/Months.java:
##
@@ -55,14 +57,13 @@ public boolean satisfiesOrderOf(Transform other) {
 }
 
 if (other instanceof Timestamps) {
-  return Timestamps.MONTH.satisfiesOrderOf(other);
+  return ((Timestamps) other).getResultTypeUnit() == ChronoUnit.MONTHS

Review Comment:
   Fixed in 3b87435b3



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]

2023-11-06 Thread via GitHub


jacobmarble commented on code in PR #8971:
URL: https://github.com/apache/iceberg/pull/8971#discussion_r1384147958


##
api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java:
##
@@ -121,17 +121,13 @@ static  R visit(Schema schema, PartitionField field, 
PartitionSpecVisitor
 } else if (transform instanceof Truncate) {
   int width = ((Truncate) transform).width();
   return visitor.truncate(field.fieldId(), sourceName, field.sourceId(), 
width);
-} else if (transform == Dates.YEAR
-|| transform == Timestamps.YEAR
-|| transform instanceof Years) {
+} else if ("year".equalsIgnoreCase(transform.toString())) {

Review Comment:
   You're right. Fixed in 09c1f2534 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]

2023-11-06 Thread via GitHub


jacobmarble commented on code in PR #8971:
URL: https://github.com/apache/iceberg/pull/8971#discussion_r1384147958


##
api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java:
##
@@ -121,17 +121,13 @@ static  R visit(Schema schema, PartitionField field, 
PartitionSpecVisitor
 } else if (transform instanceof Truncate) {
   int width = ((Truncate) transform).width();
   return visitor.truncate(field.fieldId(), sourceName, field.sourceId(), 
width);
-} else if (transform == Dates.YEAR
-|| transform == Timestamps.YEAR
-|| transform instanceof Years) {
+} else if ("year".equalsIgnoreCase(transform.toString())) {

Review Comment:
   You're right. Fixed in 09c1f2534



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]

2023-11-06 Thread via GitHub


jacobmarble commented on code in PR #8971:
URL: https://github.com/apache/iceberg/pull/8971#discussion_r1384147958


##
api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java:
##
@@ -121,17 +121,13 @@ static  R visit(Schema schema, PartitionField field, 
PartitionSpecVisitor
 } else if (transform instanceof Truncate) {
   int width = ((Truncate) transform).width();
   return visitor.truncate(field.fieldId(), sourceName, field.sourceId(), 
width);
-} else if (transform == Dates.YEAR
-|| transform == Timestamps.YEAR
-|| transform instanceof Years) {
+} else if ("year".equalsIgnoreCase(transform.toString())) {

Review Comment:
   You're right. Fixed in 
[09c1f2534](https://github.com/apache/iceberg/pull/8971/commits/09c1f253424077e93c668fc913a9e645c0607568)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]

2023-11-06 Thread via GitHub


jacobmarble commented on code in PR #8971:
URL: https://github.com/apache/iceberg/pull/8971#discussion_r1382170673


##
api/src/main/java/org/apache/iceberg/types/Types.java:
##
@@ -205,27 +208,56 @@ public String toString() {
   }
 
   public static class TimestampType extends PrimitiveType {
-private static final TimestampType INSTANCE_WITH_ZONE = new 
TimestampType(true);
-private static final TimestampType INSTANCE_WITHOUT_ZONE = new 
TimestampType(false);
+
+private static final TimestampType INSTANCE_MICROS_WITH_ZONE =
+new TimestampType(true, ChronoUnit.MICROS);
+private static final TimestampType INSTANCE_MICROS_WITHOUT_ZONE =
+new TimestampType(false, ChronoUnit.MICROS);
+private static final TimestampType INSTANCE_NANOS_WITH_ZONE =
+new TimestampType(true, ChronoUnit.NANOS);
+private static final TimestampType INSTANCE_NANOS_WITHOUT_ZONE =
+new TimestampType(false, ChronoUnit.NANOS);
 
 public static TimestampType withZone() {

Review Comment:
   Fixed in 5f30948fa



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]

2023-11-06 Thread via GitHub


jacobmarble commented on code in PR #8971:
URL: https://github.com/apache/iceberg/pull/8971#discussion_r1382167692


##
api/src/main/java/org/apache/iceberg/transforms/Transforms.java:
##
@@ -129,10 +131,14 @@ public static  Transform year(Type type) {
   case DATE:
 return (Transform) Dates.YEAR;
   case TIMESTAMP:
-return (Transform) Timestamps.YEAR;
-  default:
-throw new IllegalArgumentException("Cannot partition type " + type + " 
by year");
+switch (((TimestampType) type).unit()) {
+  case MICROS:
+return (Transform) Timestamps.YEAR_FROM_MICROS;
+  case NANOS:
+return (Transform) Timestamps.YEAR_FROM_NANOS;
+}
 }
+throw new IllegalArgumentException("Cannot partition type " + type + " by 
year");

Review Comment:
   I agree. Fixed in 74b90d9a5
   
   As I fixed this, I noticed that exceptions thrown for "unsupported timestamp 
unit" were inconsistent, so I cleaned that up in e0f6d3b1f



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]

2023-11-06 Thread via GitHub


jacobmarble commented on code in PR #8971:
URL: https://github.com/apache/iceberg/pull/8971#discussion_r1384147958


##
api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java:
##
@@ -121,17 +121,13 @@ static  R visit(Schema schema, PartitionField field, 
PartitionSpecVisitor
 } else if (transform instanceof Truncate) {
   int width = ((Truncate) transform).width();
   return visitor.truncate(field.fieldId(), sourceName, field.sourceId(), 
width);
-} else if (transform == Dates.YEAR
-|| transform == Timestamps.YEAR
-|| transform instanceof Years) {
+} else if ("year".equalsIgnoreCase(transform.toString())) {

Review Comment:
   You're right. Fixed in 1e374c62e



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-11-06 Thread via GitHub


dimas-b commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1384165960


##
core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java:
##
@@ -400,8 +400,15 @@ public void 
replaceTableViaTransactionThatAlreadyExistsAsView() {
 .buildTable(viewIdentifier, SCHEMA)
 .replaceTransaction()
 .commitTransaction())
-.isInstanceOf(NoSuchTableException.class)
-.hasMessageStartingWith("Table does not exist: ns.view");
+.satisfiesAnyOf(
+throwable ->
+assertThat(throwable)
+.isInstanceOf(NoSuchTableException.class)
+.hasMessageStartingWith("Table does not exist: ns.view"),
+throwable ->
+assertThat(throwable)

Review Comment:
   Can we follow the in-memory catalog pattern in Nessie?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-11-06 Thread via GitHub


dimas-b commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1384188720


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java:
##
@@ -135,71 +135,26 @@ protected void doCommit(TableMetadata base, TableMetadata 
metadata) {
 boolean newTable = base == null;
 String newMetadataLocation = writeNewMetadataIfRequired(newTable, 
metadata);
 
-String refName = client.refName();
-boolean failure = false;
+AtomicBoolean failure = new AtomicBoolean(false);
 try {
   String contentId = table == null ? null : table.getId();
   client.commitTable(base, metadata, newMetadataLocation, contentId, key);
-} catch (NessieConflictException ex) {
-  failure = true;
-  if (ex instanceof NessieReferenceConflictException) {
-// Throws a specialized exception, if possible
-maybeThrowSpecializedException((NessieReferenceConflictException) ex);
+} catch (NessieConflictException | NessieNotFoundException | 
HttpClientException ex) {
+  NessieUtil.handleExceptionsForCommits(ex, client.refName(), failure);
+} catch (NessieBadRequestException ex) {

Review Comment:
   I think it's a good idea to record the current commit hash for "new" 
table/view commits at the time `NessieTableOperations` is created. At the same 
time, if the commit is an update, I think we should use the hash from the 
metadata (basically track it for views too). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Spark: inconsistency in rewrite data and summary [iceberg]

2023-11-06 Thread via GitHub


github-actions[bot] commented on issue #7463:
URL: https://github.com/apache/iceberg/issues/7463#issuecomment-1797064259

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-11-06 Thread via GitHub


dimas-b commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1384194329


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java:
##
@@ -135,71 +135,26 @@ protected void doCommit(TableMetadata base, TableMetadata 
metadata) {
 boolean newTable = base == null;
 String newMetadataLocation = writeNewMetadataIfRequired(newTable, 
metadata);
 
-String refName = client.refName();
-boolean failure = false;
+AtomicBoolean failure = new AtomicBoolean(false);
 try {
   String contentId = table == null ? null : table.getId();
   client.commitTable(base, metadata, newMetadataLocation, contentId, key);
-} catch (NessieConflictException ex) {
-  failure = true;
-  if (ex instanceof NessieReferenceConflictException) {
-// Throws a specialized exception, if possible
-maybeThrowSpecializedException((NessieReferenceConflictException) ex);
+} catch (NessieConflictException | NessieNotFoundException | 
HttpClientException ex) {
+  NessieUtil.handleExceptionsForCommits(ex, client.refName(), failure);
+} catch (NessieBadRequestException ex) {

Review Comment:
   Re: `NessieBadRequestException`, I agree that we should not try to "handle" 
it. Instead the catalog should be implemented such that 
`NessieBadRequestException` does not happen.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Parquet: don't throw exception on row group filters when reading INT96 column [iceberg]

2023-11-06 Thread via GitHub


manuzhang commented on code in PR #8988:
URL: https://github.com/apache/iceberg/pull/8988#discussion_r1384231086


##
parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java:
##
@@ -199,7 +201,7 @@ public  Boolean lt(BoundReference ref, Literal 
lit) {
   int id = ref.fieldId();
 
   Boolean hasNonDictPage = isFallback.get(id);
-  if (hasNonDictPage == null || hasNonDictPage) {
+  if (hasNonDictPage == null || hasNonDictPage || isInt96Column(id)) {

Review Comment:
   Indeed, I've updated the patch. Please review again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Clarify which columns can be used for equality delete files. [iceberg]

2023-11-06 Thread via GitHub


liurenjie1024 commented on code in PR #8981:
URL: https://github.com/apache/iceberg/pull/8981#discussion_r1384272688


##
format/spec.md:
##
@@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` 
then `pos` to optimize
 
 Equality delete files identify deleted rows in a collection of data files by 
one or more column values, and may optionally contain additional columns of the 
deleted row.
 
-Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). Float and double columns cannot 
be used as delete columns in equality delete files.
+Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). The column restrictions for 
columns used in equality delete files are the same as those for [identifier 
fields](#identifier-field-ids) with the exception that optional columns and 
columns nested under optional structs are allowed (if a 
+parent struct column is null it implies the leaf column is null).

Review Comment:
   Sorry I missed the paragraph later. I think the statement is clear enough 
that a `NULL` values means `is not NULL`. I think it's reasonable to have that 
definition in equality check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Spark:CALL [rewrite_manifests] error Manifest is missing [iceberg]

2023-11-06 Thread via GitHub


372242283 commented on issue #4161:
URL: https://github.com/apache/iceberg/issues/4161#issuecomment-1797425869

   Spark:3.3
   Iceberg:13.0
   Encountering the same problem
   
   I also have this problem. I use the iceberg table of hive Catalog, and the 
operation is rewrite_data_file-> rewrite_manifest-> expire_snapshot, but the 
problem of “Manifest is missing..." occasionally occurs in the operation of 
rewrite_manifest. How do you solve the problem


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]

2023-11-06 Thread via GitHub


nk1506 commented on code in PR #8918:
URL: https://github.com/apache/iceberg/pull/8918#discussion_r1384338591


##
hive-metastore/src/test/java/org/apache/iceberg/hive/HiveMetastoreTest.java:
##
@@ -31,6 +31,10 @@
 import org.junit.jupiter.api.AfterAll;
 import org.junit.jupiter.api.BeforeAll;
 
+/*
+ * This meta-setup has been deprecated; use {@link HiveMetastoreExtension} 
instead.
+ * */
+@Deprecated

Review Comment:
   Thanks @pvary for reviewing this . Yes we have another change is in progress 
with all the tests related cleanup. 
   WIP [patch](https://github.com/nk1506/iceberg/pull/2) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]

2023-11-06 Thread via GitHub


nk1506 commented on code in PR #8918:
URL: https://github.com/apache/iceberg/pull/8918#discussion_r1384340506


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -500,6 +511,9 @@ protected String defaultWarehouseLocation(TableIdentifier 
tableIdentifier) {
 return String.format("%s/%s", databaseData.getLocationUri(), 
tableIdentifier.name());
   }
 
+} catch (NoSuchObjectException e) {
+  throw new NoSuchNamespaceException(
+  e, "Namespace does not exist: %s", 
tableIdentifier.namespace().levels()[0]);

Review Comment:
   I think we run Hive-CI with hive2 and hive3 both. Also I have validated 
manually for both versions. 
   Both 
[Hive2](https://github.com/apache/hive/blob/release-2.3.9-rc0/metastore/src/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java#L137)
 and 
[Hive3](https://github.com/apache/hive/blob/release-3.1.3-rc3/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java#L143)
 have the same error message. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Flink write iceberg bug(org.apache.iceberg.exceptions.NotFoundException) [iceberg]

2023-11-06 Thread via GitHub


lirui-apache commented on issue #5846:
URL: https://github.com/apache/iceberg/issues/5846#issuecomment-1797782929

   Sure, I'll open a PR for it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Ability to the write Metadata JSON [iceberg-python]

2023-11-06 Thread via GitHub


vrd83 commented on issue #22:
URL: https://github.com/apache/iceberg-python/issues/22#issuecomment-1797875405

   Guys, is this a prerequisite for altering the [write-ordered-by 
](https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table--write-ordered-by)
 on a table? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Remove properties from `JdbcUtil` [iceberg]

2023-11-06 Thread via GitHub


nastra closed issue #8989: Remove properties from `JdbcUtil`
URL: https://github.com/apache/iceberg/issues/8989


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] De-dup props in JdbcUtil [iceberg]

2023-11-06 Thread via GitHub


nastra merged PR #8992:
URL: https://github.com/apache/iceberg/pull/8992


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Ability to the write Metadata JSON [iceberg-python]

2023-11-06 Thread via GitHub


HonahX commented on issue #22:
URL: https://github.com/apache/iceberg-python/issues/22#issuecomment-1797920066

   Hi @vrd83. It depends on which catalog you want to use to alter the table.
   
   For the RestCatalog, this is not a prerequisite. To enable altering the 
write order, we can implement a `ReplaceSortOrder` ( `UpdateSortOrder`) in a 
similar manner to how we currently support UpdateSchema. You can see how` 
UpdateSchema` is implemented here for reference:
   
https://github.com/apache/iceberg-python/blob/03fa9f0b6a86fc13d855b24ce92e07b145faa500/pyiceberg/table/__init__.py#L1314-L1319
   
https://github.com/apache/iceberg-python/blob/03fa9f0b6a86fc13d855b24ce92e07b145faa500/pyiceberg/table/__init__.py#L264-L270
   
   For other catalogs, such as Glue, DynamoDB, SQL, and Hive, this will be a 
prerequisite. We need the `ReplaceSortOrder` thing and the three bullet points 
listed by @Fokko above.
   
   @Fokko, please correct me if I've missed anything about the RestCatalog part


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Ability to the write Metadata JSON [iceberg-python]

2023-11-06 Thread via GitHub


HonahX commented on issue #22:
URL: https://github.com/apache/iceberg-python/issues/22#issuecomment-1797949558

   @Fokko Thanks for the explanation! 
   
   > Ability to write the JSON to the object store (that was the intent of this 
PR).
   
I think we already support 
this:https://github.com/apache/iceberg-python/blob/8e8d39dacde067773d6d840b9bf65070399957a9/pyiceberg/catalog/__init__.py#L571-L572
   Could you please elaborate on this a bit? Are there additional features we 
are looking to implement beyond the `_write_metadata` method?
   
   > Have logic to update the metadata dictionary as you pointed out above. I 
think we can do this per operation (update schema, update partition-spec, 
update sort-order, etc) to keep it small and we can get it in quickly.
   
   I plan to start with `update_schema`, `set_snapshot_ref`, and 
`add_snapshot`, given that update_schema is already supported and the other two 
operations are pivotal for write support.
   
   I will try to make a draft PR soon for further discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org