date:20240129

Re: [PR] Build/Release: Upgrade to RAT 0.16.1 [iceberg]

2024-01-29 Thread via GitHub



nastra merged PR #9579:
URL: https://github.com/apache/iceberg/pull/9579


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Build: Bump org.assertj:assertj-core from 3.25.1 to 3.25.2 [iceberg]

2024-01-29 Thread via GitHub



nastra merged PR #9576:
URL: https://github.com/apache/iceberg/pull/9576


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Build: Bump org.apache.httpcomponents.client5:httpclient5 from 5.2.3 to 5.3.1 [iceberg]

2024-01-29 Thread via GitHub



nastra merged PR #9572:
URL: https://github.com/apache/iceberg/pull/9572


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Support altering views [iceberg]

2024-01-29 Thread via GitHub



nastra commented on code in PR #9510:
URL: https://github.com/apache/iceberg/pull/9510#discussion_r1469223182


##
spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteViewCommands.scala:
##
@@ -123,4 +132,17 @@ case class RewriteViewCommands(spark: SparkSession) 
extends Rule[LogicalPlan] wi
 
 collectTempViews(child)
   }
+
+  private object IsIcebergView {

Review Comment:
   I like those new names better. I've updated the existing one to 
`ResolvedIdent` as otherwise it clashes with the existing `ResolvedIdentifier`. 
Unfortunately both `unapply` methods can't be combined



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Small getting started guide on writes [iceberg-python]

2024-01-29 Thread via GitHub



Fokko commented on code in PR #311:
URL: https://github.com/apache/iceberg-python/pull/311#discussion_r1469257727


##
mkdocs/docs/index.md:
##
@@ -38,36 +38,129 @@ You can install the latest release version from pypi:
 pip install "pyiceberg[s3fs,hive]"
 ```
 
-Install it directly for GitHub (not recommended), but sometimes handy:
+You can mix and match optional dependencies depending on your needs:
+
+| Key  | Description:  
   |
+|  | 
 |
+| hive | Support for the Hive metastore
   |
+| glue | Support for AWS Glue  
   |
+| dynamodb | Support for AWS DynamoDB  
   |
+| sql-postgres | Support for SQL Catalog backed by Postgresql  
   |
+| sql-sqlite   | Support for SQL Catalog backed by SQLite  
   |
+| pyarrow  | PyArrow as a FileIO implementation to interact with the 
object store |
+| pandas   | Installs both PyArrow and Pandas  
   |
+| duckdb   | Installs both PyArrow and DuckDB  
   |
+| ray  | Installs PyArrow, Pandas, and Ray 
   |
+| s3fs | S3FS as a FileIO implementation to interact with the object 
store|
+| adlfs| ADLFS as a FileIO implementation to interact with the object 
store   |
+| snappy   | Support for snappy Avro compression   
   |
+| gcs  | GCS as the FileIO implementation to interact with the object 
store   |
+
+You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` to be able to 
fetch files from an object store.
+
+## Connecting to a catalog
+
+Iceberg leverages the [catalog to have one centralized place to organize the 
tables](https://iceberg.apache.org/catalog/). This can be a traditional Hive 
catalog to store your Iceberg tables next to the rest, a vendor solution like 
the AWS Glue catalog, or an implementation of Icebergs' own [REST 
protocol](https://github.com/apache/iceberg/tree/main/open-api). Checkout the 
[configuration](configuration.md) page to find all the configuration details.
 
+## Write a PyArrow dataframe
+
+Let's take the Taxi dataset, and write this to an Iceberg table.
+
+First download one month of data:
+
+```shell
+curl 
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet 
-o /tmp/yellow_tripdata_2023-01.parquet
 ```
-pip install 
"git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[s3fs]";
+
+Load it into your PyArrow dataframe:
+
+```python
+import pyarrow.parquet as pq
+
+df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet")
 ```
 
-Or clone the repository for local development:
+Create a new Iceberg table:
 
-```sh
-git clone https://github.com/apache/iceberg-python.git
-cd iceberg-python
-pip3 install -e ".[s3fs,hive]"
+```python
+from pyiceberg.catalog import load_catalog
+
+catalog = load_catalog("default")
+
+table = catalog.create_table(
+"default.taxi_dataset",
+schema=df.schema,  # Blocked by 
https://github.com/apache/iceberg-python/pull/305
+)
 ```
 
-You can mix and match optional dependencies depending on your needs:
+Append the dataframe to the table:
+
+```python
+table.append(df)
+len(table.scan().to_arrow())
+```
+
+3066766 rows have been written to the table.
+
+Now generate a tip-per-mile feature to train the model on:
+
+```python
+import pyarrow.compute as pc
+
+df = df.append_column("tip_per_mile", pc.divide(df["tip_amount"], 
df["trip_distance"]))
+```
+
+Evolve the schema of the table with the new column:
+
+```python
+from pyiceberg.catalog import Catalog
+
+with table.update_schema() as update_schema:
+# Blocked by https://github.com/apache/iceberg-python/pull/305
+update_schema.union_by_name(Catalog._convert_schema_if_needed(df.schema))

Review Comment:
   I fully agree. I should have been more explicit in the comment above. Once 
#305 is in, we can update the `union_by_name` to also accept `pa.Schema`. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Reading large data through Glue Catalog is SLOW [iceberg]

2024-01-29 Thread via GitHub



anechii closed issue #9559: Reading large data through Glue Catalog is SLOW
URL: https://github.com/apache/iceberg/issues/9559


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



Fokko commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469288671


##
pyiceberg/io/pyarrow.py:
##
@@ -761,6 +761,32 @@ def primitive(self, primitive: pa.DataType) -> T:
 """Visit a primitive type."""
 
 
+class PreOrderPyArrowSchemaVisitor(Generic[T], ABC):

Review Comment:
   ```suggestion
   class _PreOrderPyArrowSchemaVisitor(Generic[T], ABC):
   ```



##
pyiceberg/schema.py:
##
@@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: 
Union[Schema, IcebergType], next_id:
 class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
 """Traverses the schema and assigns monotonically increasing ids."""
 
-reserved_ids: Dict[int, int]
+old_id_to_new_id: Dict[int, int]
 
 def __init__(self, next_id_func: Optional[Callable[[], int]] = None) -> 
None:
-self.reserved_ids = {}
+self.old_id_to_new_id = {}
 counter = itertools.count(1)
 self.next_id_func = next_id_func if next_id_func is not None else 
lambda: next(counter)
 
-def _get_and_increment(self) -> int:
-return self.next_id_func()
+def _get_and_increment(self, current_id: int) -> int:
+new_id = self.next_id_func()
+self.old_id_to_new_id[current_id] = new_id
+return new_id
 
 def schema(self, schema: Schema, struct_result: Callable[[], StructType]) 
-> Schema:
-# First we keep the original identifier_field_ids here, we remap 
afterwards
-fields = struct_result().fields
-return Schema(*fields, 
identifier_field_ids=[self.reserved_ids[field_id] for field_id in 
schema.identifier_field_ids])
+return Schema(
+*struct_result().fields,
+identifier_field_ids=[self.old_id_to_new_id[field_id] for field_id 
in schema.identifier_field_ids],
+)
 
 def struct(self, struct: StructType, field_results: List[Callable[[], 
IcebergType]]) -> StructType:
-# assign IDs for this struct's fields first
-self.reserved_ids.update({field.field_id: self._get_and_increment() 
for field in struct.fields})
-return StructType(*[field() for field in field_results])
+new_ids = [self._get_and_increment(field.field_id) for field in 
struct.fields]
+new_fields = []
+for field_id, field, field_type in zip(new_ids, struct.fields, 
field_results):
+new_fields.append(
+NestedField(
+field_id=field_id,
+name=field.name,
+field_type=field_type(),
+required=field.required,
+doc=field.doc,
+)

Review Comment:
   I think it is cleaner to keep this part in the `field` method, since it 
creates a field, and the `field()` now doesn't return a field, but a type.



##
pyiceberg/io/pyarrow.py:
##
@@ -906,6 +932,21 @@ def after_map_value(self, element: pa.Field) -> None:
 self._field_names.pop()
 
 
+class _ConvertToIcebergWithNoIds(_ConvertToIceberg):

Review Comment:
   Style suggestion, feel free to ignore:
   ```suggestion
   class _ConvertToIcebergWithoutIDs(_ConvertToIceberg):
   ```



##
pyiceberg/io/pyarrow.py:
##
@@ -906,6 +986,76 @@ def after_map_value(self, element: pa.Field) -> None:
 self._field_names.pop()
 
 
+class 
_ConvertToIcebergWithFreshIds(PreOrderPyArrowSchemaVisitor[Union[IcebergType, 
Schema]]):

Review Comment:
   I also noticed that the Python implementation differs from the Java side. I 
don't know what the historical reason for this is. I couldn't find anything on 
the original PR about why this was done that way 
https://github.com/apache/iceberg/pull/5627. I'm okay with aligning this with 
the Java implementation.
   
   > Overall, this approach sounds reasonable to me if we can find an easy way 
to refactor the _SetFreshIds. @Fokko, I'd appreciate your perspective on 
refactoring _SetFreshIds. Do you see any issues with this approach?
   
   That's an okay approach, as long as the visitor to do this is hidden inside 
the package. We should not expose setting `-1` field IDs to the outside.
   
   What I like about the current implementation is that the visitor can be used 
on its own. Converting a field with all `-1` IDs doesn't provide much value on 
its own.
   
   I would love to get this in with the 0.6.0 release to simplify the creation 
of tables.



##
pyiceberg/schema.py:
##
@@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: 
Union[Schema, IcebergType], next_id:
 class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
 """Traverses the schema and assigns monotonically increasing ids."""
 
-reserved_ids: Dict[int, int]
+old_id_to_new_id: Dict[int, int]

Review Comment:
   I think at some point we just want to align this with Java. But let's do 
that in a separate PR: 
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/

Re: [PR] Spark: Support altering views [iceberg]

2024-01-29 Thread via GitHub



nastra commented on code in PR #9510:
URL: https://github.com/apache/iceberg/pull/9510#discussion_r1469223182


##
spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteViewCommands.scala:
##
@@ -123,4 +132,17 @@ case class RewriteViewCommands(spark: SparkSession) 
extends Rule[LogicalPlan] wi
 
 collectTempViews(child)
   }
+
+  private object IsIcebergView {

Review Comment:
   I like those new names better. I've updated the existing one to 
`ResolvedIdent` as otherwise it clashes with the existing `ResolvedIdentifier`. 
Unfortunately both `unapply` methods can't be combined into `ResolvedView`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Support altering views [iceberg]

2024-01-29 Thread via GitHub



nastra commented on PR #9510:
URL: https://github.com/apache/iceberg/pull/9510#issuecomment-1914349245

   @rdblue I have extracted setting/unsetting view properties into 
https://github.com/apache/iceberg/pull/9582


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] gc.enabled property is set to false by default for Apache Iceberg table created in Nessie Catalog [iceberg]

2024-01-29 Thread via GitHub



nastra closed issue #9562:  gc.enabled property is set to false by default for 
Apache Iceberg table created in Nessie Catalog
URL: https://github.com/apache/iceberg/issues/9562


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]

2024-01-29 Thread via GitHub



nastra commented on PR #9565:
URL: https://github.com/apache/iceberg/pull/9565#issuecomment-1914388103

   @ilyasahsan123 we typically apply such changes on the latest version first 
(1.18) and then backport them in a separate PR. Could you apply these changes 
to Flink 1.18 first please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]

2024-01-29 Thread via GitHub



ilyasahsan123 commented on PR #9565:
URL: https://github.com/apache/iceberg/pull/9565#issuecomment-1914437348

   Sure, I will do it. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Add support for describing/showing views [iceberg]

2024-01-29 Thread via GitHub



nastra commented on code in PR #9513:
URL: https://github.com/apache/iceberg/pull/9513#discussion_r1469444718


##
spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ExtendedDataSourceV2Strategy.scala:
##
@@ -123,6 +132,47 @@ case class ExtendedDataSourceV2Strategy(spark: 
SparkSession) extends Strategy wi
 allowExisting = allowExisting,
 replace = replace) :: Nil
 
+case AlterViewAs(ResolvedV2View(catalog, ident, _), queryText, query) =>
+  CreateV2ViewExec(
+catalog = catalog,
+ident = ident,
+queryText = queryText,
+columnAliases = Seq.empty,
+columnComments = Seq.empty,
+queryColumnNames = query.schema.fieldNames,
+viewSchema = query.schema,
+comment = None,
+properties = Map.empty,
+allowExisting = false,
+replace = true) :: Nil
+
+case SetViewProperties(ResolvedV2View(catalog, ident, _), properties) =>
+  val changes = properties.map {
+case (property, value) => ViewChange.setProperty(property, value)
+  }.toSeq
+  AlterV2ViewExec(catalog, ident, changes) :: Nil
+
+case UnsetViewProperties(ResolvedV2View(catalog, ident, view), 
propertyKeys, ifExists) =>

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.3: Add RemoveDanglingDeletes action [iceberg]

2024-01-29 Thread via GitHub



zinking commented on code in PR #6581:
URL: https://github.com/apache/iceberg/pull/6581#discussion_r1469455133


##
api/src/main/java/org/apache/iceberg/DeleteFiles.java:
##
@@ -55,6 +55,17 @@ default DeleteFiles deleteFile(DataFile file) {
 return this;
   }
 
+  /**
+   * Delete a file tracked by a {@link DeleteFile} from the underlying table.
+   *
+   * @param file a DeleteFile to remove from the table
+   * @return this for method chaining
+   */
+  default DeleteFiles deleteFile(DeleteFile file) {
+deleteFile(file.path());

Review Comment:
   This will not work, and needs to be changed to use the delete file specific 
API. and I'm surprised that UT didn't catch it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Add support for describing/showing views [iceberg]

2024-01-29 Thread via GitHub



nastra commented on code in PR #9513:
URL: https://github.com/apache/iceberg/pull/9513#discussion_r1469477447


##
spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteViewCommands.scala:
##
@@ -60,17 +65,27 @@ case class RewriteViewCommands(spark: SparkSession) extends 
Rule[LogicalPlan] wi
 properties = properties,
 allowExisting = allowExisting,
 replace = replace)
+
+case a@AlterViewAs(ResolvedV2View(_, ident, _), _, query) =>
+  val q = CTESubstitution.apply(query)
+  verifyTemporaryObjectsDontExist(ident, q)
+  a.copy(query = q)
+
+case u@UnresolvedView(IsIcebergView(resolved), _, _, _) =>
+  loadView(resolved.catalog, resolved.identifier)
+.map(v => ResolvedV2View(resolved.catalog.asViewCatalog, 
resolved.identifier, v))
+.getOrElse(u)
+
+case ShowViews(UnresolvedNamespace(ns), pattern, output)
+  if isViewCatalog(spark.sessionState.catalogManager.currentCatalog) =>
+  ShowIcebergViews(UnresolvedNamespace(ns), pattern, output)

Review Comment:
   I've updated this to use `CatalogAndNamespace` for this particular case. For 
the case where no namespace is given, we need to match 
`ShowViews(UnresolvedNamespace(Seq()), pattern, output)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Add support for describing/showing views [iceberg]

2024-01-29 Thread via GitHub



nastra commented on code in PR #9513:
URL: https://github.com/apache/iceberg/pull/9513#discussion_r1469484505


##
spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowCreateV2ViewExec.scala:
##
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.util.escapeSingleQuotedString
+import org.apache.spark.sql.connector.catalog.View
+import org.apache.spark.sql.connector.catalog.ViewCatalog
+import org.apache.spark.sql.execution.LeafExecNode
+import scala.collection.JavaConverters._
+
+case class ShowCreateV2ViewExec(output: Seq[Attribute], view: View)
+  extends V2CommandExec with LeafExecNode {
+
+  override protected def run(): Seq[InternalRow] = {
+val builder = new StringBuilder
+builder ++= s"CREATE VIEW ${view.name} "
+showColumns(view, builder)
+showComment(view, builder)
+showProperties(view, builder)
+builder ++= s"AS\n${view.query}\n"
+
+Seq(toCatalystRow(builder.toString))
+  }
+
+  private def showColumns(view: View, builder: StringBuilder): Unit = {
+val columns = view.schema().fields

Review Comment:
   I ran into this case when I was implementing this so I compared this with 
how `SHOW CREATE TABLE ` does it. 
[Here](https://github.com/apache/spark/blob/9088249f02543c549e0b330812c77edcdbbeb795/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowCreateTableExec.scala#L64-L67)
 it can be seen that it just takes the latest schema of the table and uses it 
for the output. So should we be aligning `SHOW CREATE VIEW` with `SHOW CREATE 
TABLE` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]

2024-01-29 Thread via GitHub



ilyasahsan123 commented on PR #9565:
URL: https://github.com/apache/iceberg/pull/9565#issuecomment-1914620937

   hi @nastra , I've addressed your suggestion. Could you please take another 
look? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469014546


##
tests/catalog/test_base.py:
##
@@ -330,6 +333,34 @@ def test_create_table(catalog: InMemoryCatalog) -> None:
 assert catalog.load_table(TEST_TABLE_IDENTIFIER) == table
 
 
+@pytest.mark.parametrize(
+"schema_fixture,expected_fixture",
+[
+("pyarrow_schema_simple_without_ids", "iceberg_schema_simple"),
+("iceberg_schema_simple", "iceberg_schema_simple"),
+("iceberg_schema_nested", "iceberg_schema_nested"),
+("pyarrow_schema_nested_without_ids", "iceberg_schema_nested"),
+],

Review Comment:
   Thank you for the suggestion! :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]

2024-01-29 Thread via GitHub



nastra commented on code in PR #9565:
URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469628869


##
flink/v1.18/build.gradle:
##
@@ -22,6 +22,10 @@ String scalaVersion = System.getProperty("scalaVersion") != 
null ? System.getPro
 
 project(":iceberg-flink:iceberg-flink-${flinkMajorVersion}") {
 
+  test {

Review Comment:
   this is already set in L127, so it's not needed here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]

2024-01-29 Thread via GitHub



nastra commented on code in PR #9565:
URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469634197


##
flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/TestRowDataWrapper.java:
##
@@ -49,12 +49,12 @@ public void testTime() {
   return;
 }
 
-Assertions.assertThat(actual).isNotNull();
-Assertions.assertThat(expected).isNotNull();
+assertThat(actual).isNotNull();
+assertThat(expected).isNotNull();
 
 int expectedMilliseconds = (int) ((long) expected / 1000_000);
 int actualMilliseconds = (int) ((long) actual / 1000_000);
-Assert.assertEquals(message, expectedMilliseconds, 
actualMilliseconds);
+
assertThat(expectedMilliseconds).as(message).isEqualTo(actualMilliseconds);

Review Comment:
   the actual/expected is the wrong way around. It should be 
`assertThat(actualMilliseconds).isEqualTo(expectedMilliseconds)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]

2024-01-29 Thread via GitHub



nastra commented on code in PR #9565:
URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469635766


##
flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/TestRowDataWrapper.java:
##
@@ -75,8 +75,8 @@ protected void generateAndValidate(Schema schema, 
RecordWrapperTest.AssertMethod
 StructLikeWrapper actualWrapper = 
StructLikeWrapper.forType(schema.asStruct());
 StructLikeWrapper expectedWrapper = 
StructLikeWrapper.forType(schema.asStruct());
 for (int i = 0; i < numRecords; i++) {
-  Assert.assertTrue("Should have more records", actual.hasNext());
-  Assert.assertTrue("Should have more RowData", expected.hasNext());
+  assertThat(actual.hasNext()).as("Should have more records").isTrue();

Review Comment:
   ```suggestion
 assertThat(actual).as("Should have more records").hasNext();
   ```
   same for the other places that use a similar pattern. We want to avoid using 
isTrue/isFalse as much as possible, because we want to get rich context from 
AssertJ when the assertion ever fails



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]

2024-01-29 Thread via GitHub



nastra commented on code in PR #9565:
URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469638343


##
flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/data/TestStructRowData.java:
##
@@ -22,7 +22,7 @@
 import org.apache.iceberg.flink.DataGenerator;
 import org.apache.iceberg.flink.DataGenerators;
 import org.apache.iceberg.flink.TestHelpers;
-import org.junit.Test;
+import org.junit.jupiter.api.Test;
 
 public class TestStructRowData {

Review Comment:
   why not also include the other files from the `data` package in this PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]

2024-01-29 Thread via GitHub



nastra commented on code in PR #9565:
URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469640535


##
flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/sink/TestAvroGenericRecordToRowDataMapper.java:
##
@@ -18,10 +18,11 @@
  */
 package org.apache.iceberg.flink.sink;
 
+import static org.junit.jupiter.api.Assertions.assertEquals;

Review Comment:
   wrong import. We should be using `AssertJ` assertions instead of the JUnit 
ones



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]

2024-01-29 Thread via GitHub



nastra commented on code in PR #9565:
URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469642860


##
flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/sink/TestAvroGenericRecordToRowDataMapper.java:
##
@@ -18,10 +18,11 @@
  */
 package org.apache.iceberg.flink.sink;

Review Comment:
   I would have expected to include other test files from the `sink` package 
here is well, but just picking 2 files seems quite random to me. I'd suggest to 
focus on one or more packages per PR (depending on how big the changes are) 
instead of just randomly picking files across the codebase



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469643574


##
pyiceberg/schema.py:
##
@@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: 
Union[Schema, IcebergType], next_id:
 class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
 """Traverses the schema and assigns monotonically increasing ids."""
 
-reserved_ids: Dict[int, int]
+old_id_to_new_id: Dict[int, int]

Review Comment:
   That makes sense - I'm having an aha moment here, and just wanted to 
summarize my thoughts...
   
   I think we could include the work to enable passing the baseSchema as an 
input argument to the visitor to [REPLACE 
TABLE](https://github.com/apache/iceberg-python/issues/281#issuecomment-1905187291)
 support since that's the function that requires fetching the original IDs from 
the baseSchema. CREATE TABLE, which is the only operation we've supported so 
far doesn't need to compare any IDs against a baseSchema. In the java code, 
[idFor](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/AssignFreshIds.java#L51)
 function uses the existing ID from the baseSchema by searching it by name if 
it exists. If it doesn't exist, it assigns the new ID from the counter.
   
   Another difference I see with AssignFreshIds in the java code is that it 
doesn't handle the reassignment of the identifier_field_ids within the visitor. 
Instead, it returns the schema object and searches the [new identifier field 
IDs](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/TypeUtil.java#L262)
 by name. Is there a reason we decided to step away from this approach in the 
Python implementation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]

2024-01-29 Thread via GitHub



nastra commented on code in PR #9565:
URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469643916


##
flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/sink/TestRowDataPartitionKey.java:
##
@@ -91,10 +92,10 @@ public void testNullPartitionValue() {
 for (RowData row : rows) {
   PartitionKey partitionKey = new PartitionKey(spec, schema);
   partitionKey.partition(rowWrapper.wrap(row));
-  Assert.assertEquals(partitionKey.size(), 1);
+  assertThat(partitionKey.size()).isEqualTo(1);
 
   String expectedStr = row.isNullAt(1) ? null : 
row.getString(1).toString();
-  Assert.assertEquals(expectedStr, partitionKey.get(0, String.class));
+  assertThat(expectedStr).isEqualTo(partitionKey.get(0, String.class));

Review Comment:
   the ordering of actual/expected needs to be fixed here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]

2024-01-29 Thread via GitHub



nastra commented on code in PR #9565:
URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469646130


##
flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/reader/TestArrayBatchRecords.java:
##
@@ -50,19 +51,18 @@ private void testArray(
 fileOffset,
 startingRecordOffset);
 
-Assert.assertEquals(splitId, recordsWithSplitIds.nextSplit());
+assertThat(splitId).isEqualTo(recordsWithSplitIds.nextSplit());
 
 for (int i = 0; i < numberOfRecords; i++) {
   RecordAndPosition recAndPos = 
recordsWithSplitIds.nextRecordFromSplit();
-  Assert.assertEquals(elements[i], recAndPos.record());
-  Assert.assertEquals(fileOffset, recAndPos.fileOffset());
-  // recordOffset points to the position after this one
-  Assert.assertEquals(startingRecordOffset + i + 1, 
recAndPos.recordOffset());
+  assertThat(elements[i]).isEqualTo(recAndPos.record());
+  assertThat(fileOffset).isEqualTo(recAndPos.fileOffset());

Review Comment:
   I think actual/expected are in the wrong order here as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469674294


##
pyiceberg/schema.py:
##
@@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: 
Union[Schema, IcebergType], next_id:
 class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
 """Traverses the schema and assigns monotonically increasing ids."""
 
-reserved_ids: Dict[int, int]
+old_id_to_new_id: Dict[int, int]
 
 def __init__(self, next_id_func: Optional[Callable[[], int]] = None) -> 
None:
-self.reserved_ids = {}
+self.old_id_to_new_id = {}
 counter = itertools.count(1)
 self.next_id_func = next_id_func if next_id_func is not None else 
lambda: next(counter)
 
-def _get_and_increment(self) -> int:
-return self.next_id_func()
+def _get_and_increment(self, current_id: int) -> int:
+new_id = self.next_id_func()
+self.old_id_to_new_id[current_id] = new_id
+return new_id
 
 def schema(self, schema: Schema, struct_result: Callable[[], StructType]) 
-> Schema:
-# First we keep the original identifier_field_ids here, we remap 
afterwards
-fields = struct_result().fields
-return Schema(*fields, 
identifier_field_ids=[self.reserved_ids[field_id] for field_id in 
schema.identifier_field_ids])
+return Schema(
+*struct_result().fields,
+identifier_field_ids=[self.old_id_to_new_id[field_id] for field_id 
in schema.identifier_field_ids],
+)
 
 def struct(self, struct: StructType, field_results: List[Callable[[], 
IcebergType]]) -> StructType:
-# assign IDs for this struct's fields first
-self.reserved_ids.update({field.field_id: self._get_and_increment() 
for field in struct.fields})
-return StructType(*[field() for field in field_results])
+new_ids = [self._get_and_increment(field.field_id) for field in 
struct.fields]
+new_fields = []
+for field_id, field, field_type in zip(new_ids, struct.fields, 
field_results):
+new_fields.append(
+NestedField(
+field_id=field_id,
+name=field.name,
+field_type=field_type(),
+required=field.required,
+doc=field.doc,
+)

Review Comment:
   Yes, I realized that too...
   
   I think the refactored behavior actually [mirrors the Java 
code](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/AssignFreshIds.java#L102)
 more closely now in how it takes the Type, and then creates the NestedField 
with the result of the field level Callable within the struct function call in 
order to assign the IDs first to the fields, and then to its children. 
   
   I think preserving this approach allows us to avoid introducing the new 
constraint of requiring the field IDs to be unique, and limits that restriction 
to just the [correctness in the task of refreshing the 
identifier_field_ids](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/TypeUtil.java#L262),
 which requires referring to the field Name by index as well as by name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Support altering views [iceberg]

2024-01-29 Thread via GitHub



nastra commented on code in PR #9510:
URL: https://github.com/apache/iceberg/pull/9510#discussion_r1469684751


##
spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckViews.scala:
##
@@ -36,6 +38,9 @@ object CheckViews extends (LogicalPlan => Unit) {
 verifyColumnCount(ident, columnAliases, query)
 SchemaUtils.checkColumnNameDuplication(query.schema.fieldNames, 
SQLConf.get.resolver)
 
+case AlterViewAs(ResolvedV2View(_, _, _), _, query) =>
+  SchemaUtils.checkColumnNameDuplication(query.schema.fieldNames, 
SQLConf.get.resolver)

Review Comment:
   I agree that this is a weird case and I also had a different expectation 
when initially running into this. That's why I added the 
`alterViewWithUpdatedQueryColumns()` test to make this behavior explicit.
   I also looking into how V1 views behave here and they do the same thing (aka 
losing the column comments from the original CREATE):
   ```
   spark-sql (default)> create view iceberg1.v10 (x COMMENT 'comment 1', y 
COMMENT 'comment 2') AS SELECT count(id), zip from iceberg1.foo group by zip;
   
   spark-sql (default)> show create table iceberg1.v10;
   CREATE VIEW iceberg1.v10 (
 x COMMENT 'comment 1',
 y COMMENT 'comment 2')
   TBLPROPERTIES (
 'transient_lastDdlTime' = '1706538575')
   AS SELECT count(id) AS cnt, zip from iceberg1.foo group by zip
   
   Time taken: 0.055 seconds, Fetched 1 row(s)
   
   spark-sql (default)> describe extended iceberg1.v10;
   xbigint  comment 1   
   yint comment 2   

   # Detailed Table Information 
   Catalog  spark_catalog   
   Database iceberg1
   Tablev10 
   Ownernastra  
   Created Time Mon Jan 29 15:29:35 CET 2024
   Last Access  UNKNOWN 
   Created By   Spark 3.5.0 
   Type VIEW
   View TextSELECT count(id) AS cnt, zip from iceberg1.foo group by 
zip 
   View Original Text   SELECT count(id) AS cnt, zip from iceberg1.foo group by 
zip 
   View Catalog and Namespace   spark_catalog.default   
   View Query Output Columns[cnt, zip]  
   Table Properties [transient_lastDdlTime=1706538575]  

   Serde Libraryorg.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  

   InputFormat  org.apache.hadoop.mapred.SequenceFileInputFormat

   OutputFormat 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat   

   Storage Properties   [serialization.format=1]
   Time taken: 0.063 seconds, Fetched 21 row(s)
   ```
   ```
   spark-sql (default)> alter view iceberg1.v10 AS SELECT count(zip) AS cnt, 
zip from iceberg1.foo group by zip;
   Time taken: 0.213 seconds
   
   spark-sql (default)> show create table iceberg1.v10;
   CREATE VIEW iceberg1.v10 (
 cnt,
 zip)
   TBLPROPERTIES (
 'transient_lastDdlTime' = '1706538634')
   AS SELECT count(zip) AS cnt, zip from iceberg1.foo group by zip
   
   Time taken: 0.029 seconds, Fetched 1 row(s)
   
   
   spark-sql (default)> describe extended iceberg1.v10;
   cnt  bigint  
   zip  int 

   # Detailed Table Information 
   Catalog  spark_catalog   
   Database iceberg1
   Tablev10 
   Ownernastra  
   Created Time Mon Jan 29 15:29:35 CET 2024
   Last Access  UNKNOWN 
   Created By   Spark 3.5.0 
   Type VIEW
   View TextSELECT count(zip) AS cnt, zip from iceberg1.foo group 
by zip
   View Original Text   SELECT count(zip) AS cnt, zip from iceberg1.foo group 
by zip
   View Catalog and Namespace   spark_catalog.default   
   View Query Output Columns[cnt, zip]

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#issuecomment-1914837022

   Hi @Fokko - for some reason I don't see an option to respond to this 
specific Review comment, so I'm just replying here instead:
   
   > I would love to get this in with the 0.6.0 release to simplify the 
creation of tables.
   
   Sounds good 👍  I'll try my best to bring this review to a consensus in the 
next few days. We already have a working version, so we just need to agree on 
the final details 😄 
   
   >That's an okay approach, as long as the visitor to do this is hidden inside 
the package. We should not expose setting -1 field IDs to the outside.
   > 
   > What I like about the current implementation is that the visitor can be 
used on its own. Converting a field with all -1 IDs doesn't provide much value 
on its own.
   
   I'm 🆗 with both approach, but I have a preference for the '-1' approach... 
and here's my current thought.
   
   Both approaches will work, and in both approaches, we are introducing an 
intermediate state of the Iceberg Schema that should never be exposed. But what 
I like about this approach is that we are creating an intermediate Iceberg 
Schema that is very visibly wrong, that we could use to check the validity of 
the schema before we allow the erroneous intermediate state to materialize a 
lasting result. An example could be to introduce a new sanity check before 
writing the individual files, or before making a schema update, that checks 
that no two IDs are the same, or that the IDs are greater than 0.
   
   Hence, I think the fact that the intermediate Iceberg Schema is "seemingly" 
correct actually feels more like a downside to me...
   
   I'm curious to hear if that argument resonates with you or the rest of the 
group. Hopefully it will be the last discussion item before we settle on the 
final approach 🎉🎈 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469706761


##
pyiceberg/io/pyarrow.py:
##
@@ -906,6 +932,21 @@ def after_map_value(self, element: pa.Field) -> None:
 self._field_names.pop()
 
 
+class _ConvertToIcebergWithNoIds(_ConvertToIceberg):

Review Comment:
   Thank you for the suggestion :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469710203


##
pyiceberg/io/pyarrow.py:
##
@@ -906,6 +986,76 @@ def after_map_value(self, element: pa.Field) -> None:
 self._field_names.pop()
 
 
+class 
_ConvertToIcebergWithFreshIds(PreOrderPyArrowSchemaVisitor[Union[IcebergType, 
Schema]]):

Review Comment:
   > I would love to get this in with the 0.6.0 release to simplify the 
creation of tables.
   
   Sounds good 👍  I'll try my best to bring this review to a consensus in the 
next few days. We already have a working version, so we just need to agree on 
the final details 😄 
   
   >That's an okay approach, as long as the visitor to do this is hidden inside 
the package. We should not expose setting -1 field IDs to the outside.
   > 
   > What I like about the current implementation is that the visitor can be 
used on its own. Converting a field with all -1 IDs doesn't provide much value 
on its own.
   
   I'm 🆗 with both approach, but I have a preference for the '-1' approach... 
and here's my current thought.
   
   Both approaches will work, and in both approaches, we are introducing an 
intermediate state of the Iceberg Schema that should never be exposed. But what 
I like about this approach is that we are creating an intermediate Iceberg 
Schema that is very visibly wrong, that we could use to check the validity of 
the schema before we allow the erroneous intermediate state to materialize a 
lasting result. An example could be to introduce a new sanity check before 
writing the individual files, or before making a schema update, that checks 
that no two IDs are the same, or that the IDs are greater than 0.
   
   Hence, I think the fact that the intermediate Iceberg Schema is "seemingly" 
correct actually feels more like a downside to me...
   
   I'm curious to hear if that argument resonates with you or the rest of the 
group. Hopefully it will be the last discussion item before we settle on the 
final approach 🎉🎈 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469711334


##
pyiceberg/io/pyarrow.py:
##
@@ -761,6 +761,32 @@ def primitive(self, primitive: pa.DataType) -> T:
 """Visit a primitive type."""
 
 
+class PreOrderPyArrowSchemaVisitor(Generic[T], ABC):

Review Comment:
   Good catch - I missed this one. I'll remove this Visitor as its no longer in 
use



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Support altering views [iceberg]

2024-01-29 Thread via GitHub



nastra commented on code in PR #9510:
URL: https://github.com/apache/iceberg/pull/9510#discussion_r1469766873


##
spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteViewCommands.scala:
##
@@ -60,17 +63,23 @@ case class RewriteViewCommands(spark: SparkSession) extends 
Rule[LogicalPlan] wi
 properties = properties,
 allowExisting = allowExisting,
 replace = replace)
+
+case a@AlterViewAs(ResolvedV2View(_, ident, _), _, query) =>

Review Comment:
   if we want to go this route, then we'd also need to add the same handling 
for `SetViewProperties(UnresolvedView(...))` / 
`UnsetViewProperties(UnresolvedView(...))`, hence it seemed better to me to 
just handle `case u@UnresolvedView(ResolvedView(resolved), _, _, _)` separately.
   I'm fine either way, so let me know if you'd still like me to switch to the 
approach you proposed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] docs: Add community page [iceberg-python]

2024-01-29 Thread via GitHub



Fokko opened a new pull request, #315:
URL: https://github.com/apache/iceberg-python/pull/315

   Analog to the general Iceberg page. I also think it is good to revive the 
PyIceberg sync.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] docs: Remove feature support page [iceberg-python]

2024-01-29 Thread via GitHub



Fokko opened a new pull request, #316:
URL: https://github.com/apache/iceberg-python/pull/316

   With the 0.6.0 release we'll check the last box, and we have feature parity 
with Java according to this table. I would prefer creating tickets on Github to 
point out the gaps rather than this page.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink: Added error handling and default logic for Flink version detection [iceberg]

2024-01-29 Thread via GitHub



gjacoby126 commented on PR #9452:
URL: https://github.com/apache/iceberg/pull/9452#issuecomment-1915203994

   @nastra @pvary , just checking back in. I was wondering if there were any 
more changes you'd like to see to this patch, or if it's ready for commit?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]

2024-01-29 Thread via GitHub



epgif commented on code in PR #9008:
URL: https://github.com/apache/iceberg/pull/9008#discussion_r1469960212


##
api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java:
##
@@ -564,14 +569,26 @@ private static String sanitizeDate(int days, int today) {
 return "(date)";
   }
 
-  private static String sanitizeTimestamp(long micros, long now) {
-String isPast = now > micros ? "ago" : "from-now";
-long diff = Math.abs(now - micros);
-if (diff < FIVE_MINUTES_IN_MICROS) {
+  private static String sanitizeTimestamp(ChronoUnit unit, long timeUnits, 
long nowMillis) {
+long timeMillis;
+switch (unit) {
+  case MICROS:
+timeMillis = DateTimeUtil.microsToMillis(timeUnits);
+break;
+  case NANOS:
+timeMillis = DateTimeUtil.nanosToMillis(timeUnits);

Review Comment:
   I don't see how. See line 521 of `sanitize` for example.



##
api/src/main/java/org/apache/iceberg/expressions/Literals.java:
##
@@ -501,18 +531,22 @@ public  Literal to(Type type) {
   return (Literal) new TimeLiteral(timeMicros);
 
 case TIMESTAMP:
-  if (((Types.TimestampType) type).shouldAdjustToUTC()) {
-long timestampMicros =
-ChronoUnit.MICROS.between(
-EPOCH, OffsetDateTime.parse(value(), 
DateTimeFormatter.ISO_DATE_TIME));
-return (Literal) new TimestampLiteral(timestampMicros);
+  TimestampType tsType = (TimestampType) type;
+  if (tsType.shouldAdjustToUTC()) {
+long timestampUnits =
+tsType
+.unit()
+.between(EPOCH, OffsetDateTime.parse(value(), 
DateTimeFormatter.ISO_DATE_TIME));

Review Comment:
   Maybe. Do you want that done here, or in a follow-up pull request?



##
api/src/main/java/org/apache/iceberg/expressions/Literals.java:
##
@@ -426,16 +428,40 @@ protected Type.TypeID typeId() {
   }
 
   static class TimestampLiteral extends ComparableLiteral {
-TimestampLiteral(Long value) {
+private final ChronoUnit unit;
+
+TimestampLiteral(ChronoUnit unit, Long value) {
   super(value);
+  this.unit = unit;
 }
 
 @Override
 @SuppressWarnings("unchecked")
 public  Literal to(Type type) {
   switch (type.typeId()) {
 case TIMESTAMP:
-  return (Literal) this;
+  ChronoUnit toUnit = ((TimestampType) type).unit();
+  switch (unit) {
+case MICROS:
+  switch (toUnit) {
+case MICROS:
+  return (Literal) this;
+case NANOS:
+  return (Literal)
+  new TimestampLiteral(unit, 
DateTimeUtil.microsToNanos(value()));
+  }
+  break;
+case NANOS:
+  switch (toUnit) {
+case MICROS:
+  return (Literal)
+  new TimestampLiteral(unit, 
DateTimeUtil.nanosToMicros(value()));
+case NANOS:
+  return (Literal) this;
+  }
+  break;
+  }
+  break;

Review Comment:
   I'll add that.



##
api/src/main/java/org/apache/iceberg/types/Types.java:
##
@@ -46,8 +47,10 @@ private Types() {}
   .put(DoubleType.get().toString(), DoubleType.get())
   .put(DateType.get().toString(), DateType.get())
   .put(TimeType.get().toString(), TimeType.get())
-  .put(TimestampType.withZone().toString(), TimestampType.withZone())
-  .put(TimestampType.withoutZone().toString(), 
TimestampType.withoutZone())
+  .put(TimestampType.microsWithZone().toString(), 
TimestampType.microsWithZone())
+  .put(TimestampType.microsWithoutZone().toString(), 
TimestampType.microsWithoutZone())

Review Comment:
   Are you just suggesting we leave lines 49 and 50 alone, i.e. using the 
methods we want to deprecate? Functionally there's no difference, but I'd 
prefer not to use the deprecated forms.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]

2024-01-29 Thread via GitHub



epgif commented on code in PR #9008:
URL: https://github.com/apache/iceberg/pull/9008#discussion_r1469960791


##
api/src/main/java/org/apache/iceberg/types/Types.java:
##
@@ -205,38 +208,74 @@ public String toString() {
   }
 
   public static class TimestampType extends PrimitiveType {
-private static final TimestampType INSTANCE_WITH_ZONE = new 
TimestampType(true);
-private static final TimestampType INSTANCE_WITHOUT_ZONE = new 
TimestampType(false);
 
+private static final TimestampType INSTANCE_MICROS_WITH_ZONE =
+new TimestampType(true, ChronoUnit.MICROS);
+private static final TimestampType INSTANCE_MICROS_WITHOUT_ZONE =
+new TimestampType(false, ChronoUnit.MICROS);
+private static final TimestampType INSTANCE_NANOS_WITH_ZONE =
+new TimestampType(true, ChronoUnit.NANOS);
+private static final TimestampType INSTANCE_NANOS_WITHOUT_ZONE =
+new TimestampType(false, ChronoUnit.NANOS);
+
+/** @deprecated use {@link #microsWithZone()} instead. */
+@Deprecated

Review Comment:
   Happy to add that. Are we certain it will be in "2.0" or should I say 
something like "likely removed in 2.0"?



##
api/src/main/java/org/apache/iceberg/types/Types.java:
##
@@ -205,38 +208,74 @@ public String toString() {
   }
 
   public static class TimestampType extends PrimitiveType {
-private static final TimestampType INSTANCE_WITH_ZONE = new 
TimestampType(true);
-private static final TimestampType INSTANCE_WITHOUT_ZONE = new 
TimestampType(false);
 
+private static final TimestampType INSTANCE_MICROS_WITH_ZONE =
+new TimestampType(true, ChronoUnit.MICROS);
+private static final TimestampType INSTANCE_MICROS_WITHOUT_ZONE =
+new TimestampType(false, ChronoUnit.MICROS);
+private static final TimestampType INSTANCE_NANOS_WITH_ZONE =
+new TimestampType(true, ChronoUnit.NANOS);
+private static final TimestampType INSTANCE_NANOS_WITHOUT_ZONE =
+new TimestampType(false, ChronoUnit.NANOS);
+
+/** @deprecated use {@link #microsWithZone()} instead. */
+@Deprecated
 public static TimestampType withZone() {
-  return INSTANCE_WITH_ZONE;
+  return INSTANCE_MICROS_WITH_ZONE;
 }
 
+/** @deprecated use {@link #microsWithoutZone()} instead. */
+@Deprecated
 public static TimestampType withoutZone() {
-  return INSTANCE_WITHOUT_ZONE;
+  return INSTANCE_MICROS_WITHOUT_ZONE;
+}
+
+public static TimestampType microsWithZone() {
+  return INSTANCE_MICROS_WITH_ZONE;
+}
+
+public static TimestampType microsWithoutZone() {
+  return INSTANCE_MICROS_WITHOUT_ZONE;
+}
+
+public static TimestampType nanosWithZone() {
+  return INSTANCE_NANOS_WITH_ZONE;
+}
+
+public static TimestampType nanosWithoutZone() {
+  return INSTANCE_NANOS_WITHOUT_ZONE;
 }
 
 private final boolean adjustToUTC;
+private final ChronoUnit unit;
 
-private TimestampType(boolean adjustToUTC) {
+private TimestampType(boolean adjustToUTC, ChronoUnit unit) {
   this.adjustToUTC = adjustToUTC;
+  this.unit = unit;
 }
 
 public boolean shouldAdjustToUTC() {
   return adjustToUTC;
 }
 
+public ChronoUnit unit() {

Review Comment:
   We use some ChronoUnit functionality, e.g. the `between` method. I can 
introduce our own enum wrapping a ChonoUnit and delegating that method, if you 
prefer.



##
api/src/test/java/org/apache/iceberg/TestPartitionPaths.java:
##
@@ -42,8 +42,8 @@ public void testPartitionPath() {
 Transform bucket = Transforms.bucket(10);
 
 Literal ts =
-
Literal.of("2017-12-01T10:12:55.038194").to(Types.TimestampType.withoutZone());
-Object tsHour = 
hour.bind(Types.TimestampType.withoutZone()).apply(ts.value());
+
Literal.of("2017-12-01T10:12:55.038194").to(Types.TimestampType.microsWithoutZone());
+Object tsHour = 
hour.bind(Types.TimestampType.microsWithoutZone()).apply(ts.value());

Review Comment:
   Will do.



##
api/src/test/java/org/apache/iceberg/expressions/TestStringLiteralConversions.java:
##
@@ -101,7 +101,7 @@ public void testStringToTimestampLiteral() {
 
 // Timestamp with explicit UTC offset, +00:00
 Literal timestampStr = 
Literal.of("2017-08-18T14:21:01.919+00:00");
-Literal timestamp = timestampStr.to(Types.TimestampType.withZone());
+Literal timestamp = 
timestampStr.to(Types.TimestampType.microsWithZone());

Review Comment:
   Will do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.or

Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]

2024-01-29 Thread via GitHub



epgif commented on code in PR #9008:
URL: https://github.com/apache/iceberg/pull/9008#discussion_r1469961074


##
api/src/test/java/org/apache/iceberg/types/TestConversions.java:
##
@@ -93,15 +93,26 @@ public void testByteBufferConversions() {
 assertThat(Literal.of(1L).to(TimeType.get()).toByteBuffer().array())
 .isEqualTo(new byte[] {16, 39, 0, 0, 0, 0, 0, 0});
 
-// timestamps are stored as microseconds from 1970-01-01 00:00:00.00 
in an 8-byte
+// timestamps are stored as micro|nanoseconds from 1970-01-01 00:00:00 in 
an
+// 8-byte

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Add View Support to Trino [iceberg]

2024-01-29 Thread via GitHub



maxpoulain commented on issue #7939:
URL: https://github.com/apache/iceberg/issues/7939#issuecomment-1915233640

   Hi ! Is there any news on this subject ? Is this issue was moved to Trino 
project ?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.4: Fix writing of default values in CoW for rows with NULL columns which are unmatched [iceberg]

2024-01-29 Thread via GitHub



amogh-jahagirdar commented on PR #9556:
URL: https://github.com/apache/iceberg/pull/9556#issuecomment-1915239004

   > Just to be sure, can you set spark.sql.planChangeLog.level to info and 
execute the failing test to see what rule adds the default value behavior? It 
is still not clear how a wrong nullability info in MergeRows led to inserting 
default values.
   
   Sure, I've set that for before/after the fix (specifically the fix @rdblue 
mentioned here 
https://github.com/apache/iceberg/pull/9556#discussion_r1468932165). 
   
   Logs Before: 
https://gist.github.com/amogh-jahagirdar/98ffeaa21e203423c7da0c98e132e9bd
   Logs After: 
https://gist.github.com/amogh-jahagirdar/6209e373b37e2d7a134df447ec49f8b5
   
   I don't see anything differentiating in terms of the rules and their 
outputs.  I can still explain why there's a default value behavior though based 
on what I saw the debugger. 
   
   The Iceberg schema used for performing the write that gets built here 
SparkWriteBuilder#validateOrMergeWriteSchema: 
https://github.com/apache/iceberg/blob/54756b6f5c653be2ab271bfbe262e77109bf9608/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java#L120
   
   That schema makes the fields *required* instead of optional, due to the 
nullability passed through in the Spark Writer API being incorrect. Then when 
the write is actually performed, we end up writing default values.
   
   I've updated this PR to follow the approach @rdblue 
https://github.com/apache/iceberg/pull/9556#discussion_r1468932165 which after 
going through the thread I believe does make sense, and is a better solution 
compared to the previous approach which just forces the matched output list to 
have the attributes for making the nullability check behave as expected.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.4: Fix writing of default values in CoW for rows with NULL columns which are unmatched [iceberg]

2024-01-29 Thread via GitHub



amogh-jahagirdar commented on code in PR #9556:
URL: https://github.com/apache/iceberg/pull/9556#discussion_r1469971672


##
spark/v3.4/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteMergeIntoTable.scala:
##
@@ -214,6 +214,8 @@ object RewriteMergeIntoTable extends 
RewriteRowLevelIcebergCommand with Predicat
 
 val rowFromSourceAttr = resolveAttrRef(ROW_FROM_SOURCE_REF, joinPlan)
 val rowFromTargetAttr = resolveAttrRef(ROW_FROM_TARGET_REF, joinPlan)
+// The output expression should retain read attributes for correctly 
determining nullability
+val matchedOutputsWithAttrs = matchedActions.map(matchedActionOutput(_, 
metadataAttrs) :+ readAttrs)

Review Comment:
   Thanks @rdblue appreciate the detailed thread with the reasoning, I followed 
along and it makes sense. I agree that adding the readAttrs to the 
notMatchedOutput makes more sense for having a `MergeRows` logical plan which 
is logically sound. I've updated the PR. @aokolnychyi let us know what you 
think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.4: Fix writing of default values in CoW for rows with NULL columns which are unmatched [iceberg]

2024-01-29 Thread via GitHub



amogh-jahagirdar commented on code in PR #9556:
URL: https://github.com/apache/iceberg/pull/9556#discussion_r1469971672


##
spark/v3.4/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteMergeIntoTable.scala:
##
@@ -214,6 +214,8 @@ object RewriteMergeIntoTable extends 
RewriteRowLevelIcebergCommand with Predicat
 
 val rowFromSourceAttr = resolveAttrRef(ROW_FROM_SOURCE_REF, joinPlan)
 val rowFromTargetAttr = resolveAttrRef(ROW_FROM_TARGET_REF, joinPlan)
+// The output expression should retain read attributes for correctly 
determining nullability
+val matchedOutputsWithAttrs = matchedActions.map(matchedActionOutput(_, 
metadataAttrs) :+ readAttrs)

Review Comment:
   Thanks @rdblue appreciate the detailed thread with the reasoning, I followed 
along and it makes sense. I agree that adding the readAttrs to the 
notMatchedOutput makes more sense for having a `MergeRows` logical plan which 
is logically sound compared to the previous approach which more forces the 
nullability check to pass just by passing the attributes along with the matched 
output list. I've updated the PR. @aokolnychyi let us know what you think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Add View Support to Trino [iceberg]

2024-01-29 Thread via GitHub



amogh-jahagirdar closed issue #7939: Add View Support to Trino
URL: https://github.com/apache/iceberg/issues/7939


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Add View Support to Trino [iceberg]

2024-01-29 Thread via GitHub



amogh-jahagirdar commented on issue #7939:
URL: https://github.com/apache/iceberg/issues/7939#issuecomment-1915260354

   Hey @maxpoulain yes, the tracking issue is in Trino 
https://github.com/trinodb/trino/issues/14120#issuecomment-1803568597
   
   I also have a PR open there for View Support in Trino for the REST Catalog 
if you're interested: https://github.com/trinodb/trino/pull/19818


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] docs: Add community page [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on code in PR #315:
URL: https://github.com/apache/iceberg-python/pull/315#discussion_r1470043007


##
mkdocs/docs/community.md:
##
@@ -0,0 +1,64 @@
+---
+hide:
+  - navigation
+---
+
+
+
+# Join the community
+
+Apache Iceberg tracks issues in GitHub and prefers to receive contributions as 
pull requests.
+
+Community discussions happen primarily on the [dev mailing 
list](https://lists.apache.org/list.html?d...@iceberg.apache.org), on [Apache 
Iceberg Slack 
workspace](https://join.slack.com/t/apache-iceberg/shared_invite/zt-287g3akar-K9Oe_En5j1UL7Y_Ikpai3A)
 in the #python channel, and on specific [GitHub 
issues](https://github.com/apache/iceberg-python/issues).
+
+## Iceberg Community Events
+
+The PyIceberg community sync is on each last Tuesday each month. To join, make 
sure to subscribe to the [iceberg-python-sync Google 
group](https://groups.google.com/g/iceberg-python-sync).

Review Comment:
   super nit (feel free to take it or leave it): I'm wondering if this would 
read better
   
   ```suggestion
   The PyIceberg community sync is on the last Tuesday of every month. To join, 
make sure to subscribe to the [iceberg-python-sync Google 
group](https://groups.google.com/g/iceberg-python-sync).
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] docs: Add community page [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on code in PR #315:
URL: https://github.com/apache/iceberg-python/pull/315#discussion_r1470043007


##
mkdocs/docs/community.md:
##
@@ -0,0 +1,64 @@
+---
+hide:
+  - navigation
+---
+
+
+
+# Join the community
+
+Apache Iceberg tracks issues in GitHub and prefers to receive contributions as 
pull requests.
+
+Community discussions happen primarily on the [dev mailing 
list](https://lists.apache.org/list.html?d...@iceberg.apache.org), on [Apache 
Iceberg Slack 
workspace](https://join.slack.com/t/apache-iceberg/shared_invite/zt-287g3akar-K9Oe_En5j1UL7Y_Ikpai3A)
 in the #python channel, and on specific [GitHub 
issues](https://github.com/apache/iceberg-python/issues).
+
+## Iceberg Community Events
+
+The PyIceberg community sync is on each last Tuesday each month. To join, make 
sure to subscribe to the [iceberg-python-sync Google 
group](https://groups.google.com/g/iceberg-python-sync).

Review Comment:
   super nit (please feel free to take it or leave it): I'm wondering if this 
would read better
   
   ```suggestion
   The PyIceberg community sync is on the last Tuesday of every month. To join, 
make sure to subscribe to the [iceberg-python-sync Google 
group](https://groups.google.com/g/iceberg-python-sync).
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] docs: Add community page [iceberg-python]

2024-01-29 Thread via GitHub



Fokko commented on code in PR #315:
URL: https://github.com/apache/iceberg-python/pull/315#discussion_r1470046433


##
mkdocs/docs/community.md:
##
@@ -0,0 +1,64 @@
+---
+hide:
+  - navigation
+---
+
+
+
+# Join the community
+
+Apache Iceberg tracks issues in GitHub and prefers to receive contributions as 
pull requests.
+
+Community discussions happen primarily on the [dev mailing 
list](https://lists.apache.org/list.html?d...@iceberg.apache.org), on [Apache 
Iceberg Slack 
workspace](https://join.slack.com/t/apache-iceberg/shared_invite/zt-287g3akar-K9Oe_En5j1UL7Y_Ikpai3A)
 in the #python channel, and on specific [GitHub 
issues](https://github.com/apache/iceberg-python/issues).
+
+## Iceberg Community Events
+
+The PyIceberg community sync is on each last Tuesday each month. To join, make 
sure to subscribe to the [iceberg-python-sync Google 
group](https://groups.google.com/g/iceberg-python-sync).

Review Comment:
   Thanks, good suggestion :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Small getting started guide on writes [iceberg-python]

2024-01-29 Thread via GitHub



HonahX commented on code in PR #311:
URL: https://github.com/apache/iceberg-python/pull/311#discussion_r1470060936


##
mkdocs/docs/index.md:
##
@@ -38,36 +38,129 @@ You can install the latest release version from pypi:
 pip install "pyiceberg[s3fs,hive]"
 ```
 
-Install it directly for GitHub (not recommended), but sometimes handy:
+You can mix and match optional dependencies depending on your needs:
+
+| Key  | Description:  
   |
+|  | 
 |
+| hive | Support for the Hive metastore
   |
+| glue | Support for AWS Glue  
   |
+| dynamodb | Support for AWS DynamoDB  
   |
+| sql-postgres | Support for SQL Catalog backed by Postgresql  
   |
+| sql-sqlite   | Support for SQL Catalog backed by SQLite  
   |
+| pyarrow  | PyArrow as a FileIO implementation to interact with the 
object store |
+| pandas   | Installs both PyArrow and Pandas  
   |
+| duckdb   | Installs both PyArrow and DuckDB  
   |
+| ray  | Installs PyArrow, Pandas, and Ray 
   |
+| s3fs | S3FS as a FileIO implementation to interact with the object 
store|
+| adlfs| ADLFS as a FileIO implementation to interact with the object 
store   |
+| snappy   | Support for snappy Avro compression   
   |
+| gcs  | GCS as the FileIO implementation to interact with the object 
store   |
+
+You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` to be able to 
fetch files from an object store.
+
+## Connecting to a catalog
+
+Iceberg leverages the [catalog to have one centralized place to organize the 
tables](https://iceberg.apache.org/catalog/). This can be a traditional Hive 
catalog to store your Iceberg tables next to the rest, a vendor solution like 
the AWS Glue catalog, or an implementation of Icebergs' own [REST 
protocol](https://github.com/apache/iceberg/tree/main/open-api). Checkout the 
[configuration](configuration.md) page to find all the configuration details.
 
+## Write a PyArrow dataframe
+
+Let's take the Taxi dataset, and write this to an Iceberg table.
+
+First download one month of data:
+
+```shell
+curl 
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet 
-o /tmp/yellow_tripdata_2023-01.parquet
 ```
-pip install 
"git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[s3fs]";
+
+Load it into your PyArrow dataframe:
+
+```python
+import pyarrow.parquet as pq
+
+df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet")
 ```
 
-Or clone the repository for local development:
+Create a new Iceberg table:
 
-```sh
-git clone https://github.com/apache/iceberg-python.git
-cd iceberg-python
-pip3 install -e ".[s3fs,hive]"
+```python
+from pyiceberg.catalog import load_catalog
+
+catalog = load_catalog("default")
+
+table = catalog.create_table(
+"default.taxi_dataset",
+schema=df.schema,  # Blocked by 
https://github.com/apache/iceberg-python/pull/305
+)
 ```
 
-You can mix and match optional dependencies depending on your needs:
+Append the dataframe to the table:
+
+```python
+table.append(df)
+len(table.scan().to_arrow())
+```
+
+3066766 rows have been written to the table.
+
+Now generate a tip-per-mile feature to train the model on:
+
+```python
+import pyarrow.compute as pc
+
+df = df.append_column("tip_per_mile", pc.divide(df["tip_amount"], 
df["trip_distance"]))
+```
+
+Evolve the schema of the table with the new column:
+
+```python
+from pyiceberg.catalog import Catalog
+
+with table.update_schema() as update_schema:
+# Blocked by https://github.com/apache/iceberg-python/pull/305
+update_schema.union_by_name(Catalog._convert_schema_if_needed(df.schema))

Review Comment:
   Thanks for the context! Yes, I think it's a good idea. Let's do it after 
#305 😄 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Fix writing to local filesystem [iceberg-python]

2024-01-29 Thread via GitHub



Fokko merged PR #301:
URL: https://github.com/apache/iceberg-python/pull/301


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] docs: Remove feature support page [iceberg-python]

2024-01-29 Thread via GitHub



Fokko merged PR #316:
URL: https://github.com/apache/iceberg-python/pull/316


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] docs: Remove feature support page [iceberg-python]

2024-01-29 Thread via GitHub



Fokko commented on PR #316:
URL: https://github.com/apache/iceberg-python/pull/316#issuecomment-1915379284

   Thanks everyone for the quick review 🙌 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



Fokko commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470106203


##
pyiceberg/schema.py:
##
@@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: 
Union[Schema, IcebergType], next_id:
 class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
 """Traverses the schema and assigns monotonically increasing ids."""
 
-reserved_ids: Dict[int, int]
+old_id_to_new_id: Dict[int, int]

Review Comment:
   > CREATE TABLE, which is the only operation we've supported so far doesn't 
need to compare any IDs against a baseSchema. In the java code, idFor function 
uses the existing ID from the baseSchema by searching it by name if it exists. 
If it doesn't exist, it assigns the new ID from the counter.
   
   Exactly, we could just pass in an empty schema (to avoid all the 
null-checks).
   
   > Instead, it returns the schema object and searches the new identifier 
field IDs by name. Is there a reason we decided to step away from this approach 
in the Python implementation?
   
   I don't think there was an explicit choice made there. The lookup by column 
name is correct, we do the same when evolving the schema: 
   
https://github.com/apache/iceberg-python/blob/a3e36838b05e4ac434ab394af758054b12d6/pyiceberg/table/__init__.py#L1404



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.4: Support creating views via SQL [iceberg]

2024-01-29 Thread via GitHub



amogh-jahagirdar merged PR #9580:
URL: https://github.com/apache/iceberg/pull/9580


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470130736


##
pyiceberg/schema.py:
##
@@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: 
Union[Schema, IcebergType], next_id:
 class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
 """Traverses the schema and assigns monotonically increasing ids."""
 
-reserved_ids: Dict[int, int]
+old_id_to_new_id: Dict[int, int]

Review Comment:
   Ah interesting! Does this mean we could remove this from _SetFieldIDs, and 
instead handle the identifier_field_names update in the corresponding operation 
that uses it? That would simplify this visitor even more 😄 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470130736


##
pyiceberg/schema.py:
##
@@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: 
Union[Schema, IcebergType], next_id:
 class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
 """Traverses the schema and assigns monotonically increasing ids."""
 
-reserved_ids: Dict[int, int]
+old_id_to_new_id: Dict[int, int]

Review Comment:
   Ah interesting! Does this mean we could remove this from _SetFieldIDs, and 
instead handle the identifier_field_names update in the corresponding operation 
that requires updating identifier_field_names to the new IDs? That would 
simplify this visitor even more 😄 
   
   Right now, only create_table uses this visitor and doesn't require 
identifier_field_names to be updated (it'll always be an empty list for a new 
table)
   
   If 
[UpdateSchema](https://github.com/apache/iceberg-python/blob/a3e36838b05e4ac434ab394af758054b12d6/pyiceberg/table/__init__.py#L1437)
 is already doing this within corresponding class that wraps the transaction, I 
think we could do the same in other operations that would need to do the same 
(like ReplaceTable) instead of keeping this feature within _SetFieldIDs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] InMemory Catalog Implementation [iceberg-python]

2024-01-29 Thread via GitHub



Fokko commented on code in PR #289:
URL: https://github.com/apache/iceberg-python/pull/289#discussion_r1469989182


##
pyiceberg/catalog/__init__.py:
##
@@ -137,12 +138,19 @@ def load_sql(name: str, conf: Properties) -> Catalog:
 raise NotInstalledError("SQLAlchemy support not installed: pip install 
'pyiceberg[sql-postgres]'") from exc
 
 
+def load_memory(name: str, conf: Properties) -> Catalog:
+from pyiceberg.catalog.in_memory import InMemoryCatalog
+
+return InMemoryCatalog(name, **conf)
+
+
 AVAILABLE_CATALOGS: dict[CatalogType, Callable[[str, Properties], Catalog]] = {
 CatalogType.REST: load_rest,
 CatalogType.HIVE: load_hive,
 CatalogType.GLUE: load_glue,
 CatalogType.DYNAMODB: load_dynamodb,
 CatalogType.SQL: load_sql,
+CatalogType.MEMORY: load_memory,

Review Comment:
   Can you also add this one to the docs: 
https://py.iceberg.apache.org/configuration/ With a warning that this is just 
for testing purposes only.



##
pyiceberg/catalog/in_memory.py:
##
@@ -0,0 +1,222 @@
+import uuid
+from typing import (
+Dict,
+List,
+Optional,
+Set,
+Union,
+)
+
+from pyiceberg.catalog import (
+Catalog,
+Identifier,
+Properties,
+PropertiesUpdateSummary,
+)
+from pyiceberg.exceptions import (
+NamespaceAlreadyExistsError,
+NamespaceNotEmptyError,
+NoSuchNamespaceError,
+NoSuchTableError,
+TableAlreadyExistsError,
+)
+from pyiceberg.io import WAREHOUSE
+from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.table import (
+CommitTableRequest,
+CommitTableResponse,
+Table,
+update_table_metadata,
+)
+from pyiceberg.table.metadata import new_table_metadata
+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.typedef import EMPTY_DICT
+
+DEFAULT_WAREHOUSE_LOCATION = "file:///tmp/warehouse"
+
+
+class InMemoryCatalog(Catalog):
+"""An in-memory catalog implementation."""
+
+__tables: Dict[Identifier, Table]
+__namespaces: Dict[Identifier, Properties]
+
+def __init__(self, name: str, **properties: str) -> None:
+super().__init__(name, **properties)
+self.__tables = {}
+self.__namespaces = {}
+self._warehouse_location = properties.get(WAREHOUSE, None) or 
DEFAULT_WAREHOUSE_LOCATION
+
+def create_table(
+self,
+identifier: Union[str, Identifier],
+schema: Schema,
+location: Optional[str] = None,
+partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC,
+sort_order: SortOrder = UNSORTED_SORT_ORDER,
+properties: Properties = EMPTY_DICT,
+table_uuid: Optional[uuid.UUID] = None,
+) -> Table:
+identifier = Catalog.identifier_to_tuple(identifier)
+namespace = Catalog.namespace_from(identifier)
+
+if identifier in self.__tables:
+raise TableAlreadyExistsError(f"Table already exists: 
{identifier}")
+else:
+if namespace not in self.__namespaces:
+self.__namespaces[namespace] = {}
+
+if not location:
+location = f'{self._warehouse_location}/{"/".join(identifier)}'
+
+metadata_location = 
f'{self._warehouse_location}/{"/".join(identifier)}/metadata/metadata.json'

Review Comment:
   It looks like we don't write the metadata here, but we write it below at the 
`_commit` method



##
pyiceberg/catalog/in_memory.py:
##
@@ -0,0 +1,222 @@
+import uuid
+from typing import (
+Dict,
+List,
+Optional,
+Set,
+Union,
+)
+
+from pyiceberg.catalog import (
+Catalog,
+Identifier,
+Properties,
+PropertiesUpdateSummary,
+)
+from pyiceberg.exceptions import (
+NamespaceAlreadyExistsError,
+NamespaceNotEmptyError,
+NoSuchNamespaceError,
+NoSuchTableError,
+TableAlreadyExistsError,
+)
+from pyiceberg.io import WAREHOUSE
+from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.table import (
+CommitTableRequest,
+CommitTableResponse,
+Table,
+update_table_metadata,
+)
+from pyiceberg.table.metadata import new_table_metadata
+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.typedef import EMPTY_DICT
+
+DEFAULT_WAREHOUSE_LOCATION = "file:///tmp/warehouse"
+
+
+class InMemoryCatalog(Catalog):
+"""An in-memory catalog implementation."""
+
+__tables: Dict[Identifier, Table]
+__namespaces: Dict[Identifier, Properties]
+
+def __init__(self, name: str, **properties: str) -> None:
+super().__init__(name, **properties)
+self.__tables = {}
+self.__namespaces = {}
+self._warehouse_location = properties.get(WAREHOUSE, None) or 
DEFAULT_WAREHOUSE_LOCATION
+
+def create_table(
+self,
+identifier: Union[str, Identifier

[PR] Build: Bump pydantic from 2.5.3 to 2.6.0 [iceberg-python]

2024-01-29 Thread via GitHub



dependabot[bot] opened a new pull request, #317:
URL: https://github.com/apache/iceberg-python/pull/317

   Bumps [pydantic](https://github.com/pydantic/pydantic) from 2.5.3 to 2.6.0.
   
   Release notes
   Sourced from https://github.com/pydantic/pydantic/releases";>pydantic's 
releases.
   
   v2.6.0 2024-01-29
   v2.6.0 (2024-01-29)
   https://github.com/pydantic/pydantic/releases/tag/v2.6.0";>GitHub 
release
   The code released in v2.6.0 is practically identical to that of 
v2.6.0b1.
   What's Changed
   Packaging
   
   Check for email-validator version >= 2.0 by https://github.com/commonism";>@commonism in https://redirect.github.com/pydantic/pydantic/pull/6033";>#6033
   Upgrade ruff target version to Python 3.8 by https://github.com/Elkiwa";>@Elkiwa in https://redirect.github.com/pydantic/pydantic/pull/8341";>#8341
   Update to pydantic-extra-types==2.4.1 by https://github.com/yezz123";>@yezz123 in https://redirect.github.com/pydantic/pydantic/pull/8478";>#8478
   Update to pyright==1.1.345 by https://github.com/Viicos";>@Viicos in https://redirect.github.com/pydantic/pydantic/pull/8453";>#8453
   Update pydantic-core from 2.14.6 to 2.16.1, significant changes from 
these updates are described below, full changelog https://github.com/pydantic/pydantic-core/compare/v2.14.6...v2.16.1";>here
   
   New Features
   
   Add NatsDsn by https://github.com/ekeew";>@ekeew in https://redirect.github.com/pydantic/pydantic/pull/6874";>#6874
   Add ConfigDict.ser_json_inf_nan by https://github.com/davidhewitt";>@davidhewitt in https://redirect.github.com/pydantic/pydantic/pull/8159";>#8159
   Add types.OnErrorOmit by https://github.com/adriangb";>@adriangb in https://redirect.github.com/pydantic/pydantic/pull/8222";>#8222
   Support AliasGenerator usage by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/8282";>#8282
   Add Pydantic People Page to docs by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/8345";>#8345
   Support -MM-DD datetime parsing by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/8404";>#8404
   Added bits conversions to the ByteSize class https://redirect.github.com/pydantic/pydantic/issues/8415";>#8415 by 
https://github.com/luca-matei";>@luca-matei in https://redirect.github.com/pydantic/pydantic/pull/8507";>#8507
   Enable json schema creation with type ByteSize by https://github.com/geospackle";>@geospackle in https://redirect.github.com/pydantic/pydantic/pull/8537";>#8537
   Add eval_type_backport to handle union operator and builtin 
generic subscripting in older Pythons by https://github.com/alexmojaki";>@alexmojaki in https://redirect.github.com/pydantic/pydantic/pull/8209";>#8209
   Add support for dataclass fields init by https://github.com/dmontagu";>@dmontagu in https://redirect.github.com/pydantic/pydantic/pull/8552";>#8552
   Implement pickling for ValidationError by https://github.com/davidhewitt";>@davidhewitt in https://redirect.github.com/pydantic/pydantic-core/pull/1119";>pydantic/pydantic-core#1119
   Add unified tuple validator that can handle "variadic" tuples 
via PEP-646 by https://github.com/dmontagu";>@dmontagu in https://redirect.github.com/pydantic/pydantic-core/pull/865";>pydantic/pydantic-core#865
   
   Changes
   
   Drop Python3.7 support by https://github.com/hramezani";>@hramezani in https://redirect.github.com/pydantic/pydantic/pull/7188";>#7188
   Drop Python 3.7, and PyPy 3.7 and 3.8 by https://github.com/davidhewitt";>@davidhewitt in https://redirect.github.com/pydantic/pydantic-core/pull/1129";>pydantic/pydantic-core#1129
   Use positional-only self in BaseModel 
constructor, so no field name can ever conflict with it by https://github.com/ariebovenberg";>@ariebovenberg in https://redirect.github.com/pydantic/pydantic/pull/8072";>#8072
   Make @validate_call return a function instead of a custom 
descriptor - fixes binding issue with inheritance and adds 
self/cls argument to validation errors by https://github.com/alexmojaki";>@alexmojaki in https://redirect.github.com/pydantic/pydantic/pull/8268";>#8268
   Exclude BaseModel docstring from JSON schema description by 
https://github.com/sydney-runkle";>@sydney-runkle in 
https://redirect.github.com/pydantic/pydantic/pull/8352";>#8352
   Introducing classproperty decorator for 
model_computed_fields by https://github.com/Jocelyn-Gas";>@Jocelyn-Gas in https://redirect.github.com/pydantic/pydantic/pull/8437";>#8437
   Explicitly raise an error if field names clashes with types by https://github.com/Viicos";>@Viicos in https://redirect.github.com/pydantic/pydantic/pull/8243";>#8243
   Use stricter serializer for unions of simple types by https://github.com/alexdrydew";>@alexdrydew https://redirect.github.com/pydantic/pydantic-core/pull/1132";>pydantic/pydantic-core#1132
   
   Performance
   
   Add Codspeed p

[PR] Build: Bump mkdocs-material from 9.5.5 to 9.5.6 [iceberg-python]

2024-01-29 Thread via GitHub



dependabot[bot] opened a new pull request, #318:
URL: https://github.com/apache/iceberg-python/pull/318

   Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 
9.5.5 to 9.5.6.
   
   Release notes
   Sourced from https://github.com/squidfunk/mkdocs-material/releases";>mkdocs-material's 
releases.
   
   mkdocs-material-9.5.6
   
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6700";>#6700:
 Missing styles for Mermaid.js labels with Markdown
   
   
   
   
   Changelog
   Sourced from https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG";>mkdocs-material's
 changelog.
   
   mkdocs-material-9.5.6+insiders-4.52.0 (2024-01-28)
   
   Added support for instant previews
   Fixed footnote tooltips positioning edge cases
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6703";>#6703:
 New tags plugin crashes on Windows
   
   mkdocs-material-9.5.6 (2024-01-28)
   
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6700";>#6700:
 Missing styles for Mermaid.js labels with Markdown
   
   mkdocs-material-9.5.5+insiders-4.51.0 (2024-01-24)
   
   Added support for footnote tooltips
   
   mkdocs-material-9.5.5 (2024-01-24)
   
   Updated Tagalog translations
   Updated Pillow to 10.2 to mitigate security vulnerabilities
   Improved resilience of instant navigation
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6687";>#6687:
 Updated Mermaid.js to version 10.7.0 (latest)
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6652";>#6652:
 Keyboard events in custom elements captured
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6582";>#6582:
 Instant navigation doesn't correctly handle alternate URLs
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6565";>#6565:
 Instant navigation doesn't allow for onclick handlers
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6345";>#6345:
 Instant navigation sometimes breaks browser back button
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6334";>#6334:
 Instant navigation doesn't correctly position anchors (Safari)
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6275";>#6275:
 Instant navigation doesn't correctly resolve after 404
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6102";>#6102:
 Instant navigation reloads page on same link navigation
   
   mkdocs-material-9.5.4+insiders-4.50.0 (2024-01-19)
   
   Added configurable logging capabilities to privacy plugin
   
   mkdocs-material-9.5.4 (2024-01-15)
   
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6645";>#6645:
 Local storage with invalid value can break site
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6635";>#6635:
 Tags icons before default ignored if default is set
   
   mkdocs-material-9.5.3+insiders-4.49.2 (2024-01-09)
   
   Fixed missing attribute lists extension for tags plugin
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6627";>#6627:
 New tags plugin crashes on Python 3.8
   
   mkdocs-material-9.5.3+insiders-4.49.1 (2024-01-07)
   
   Improved interop of new tags plugin with other plugins
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6594";>#6594:
 Tags plugin doesn't work with mkdocs-macros plugin
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6569";>#6569:
 Social plugin crashes if in different file system location
   
   mkdocs-material-9.5.3+insiders-4.49.0 (2023-12-29)
   
   
   ... (truncated)
   
   
   Commits
   
   https://github.com/squidfunk/mkdocs-material/commit/602673881952d10ea65c4ef98f76ceccabf29e3f";>6026738
 Prepare 9.5.6 release
   https://github.com/squidfunk/mkdocs-material/commit/a65fd1ed9e958beccda9e8c23140251e7243ad86";>a65fd1e
 Updated distribution files
   https://github.com/squidfunk/mkdocs-material/commit/335dd3addb0c5875d42349a9a06ea2b4b7055374";>335dd3a
 Fixed Markdown formatted nodes in Mermaid.js
   https://github.com/squidfunk/mkdocs-material/commit/f55ce2c81b6c391c40a139660e3a1e0032182848";>f55ce2c
 Updated Insiders changelog
   https://github.com/squidfunk/mkdocs-material/commit/672414163c51b8cd753539724a8c139ca7eb538b";>6724141
 Documentation
   See full diff in https://github.com/squidfunk/mkdocs-material/compare/9.5.5...9.5.6";>compare
 view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=mkdocs-material&package-manager=pip&previous-version=9.5.5&new-version=9.5.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (depe

[PR] Build: Bump pytest from 7.4.4 to 8.0.0 [iceberg-python]

2024-01-29 Thread via GitHub



dependabot[bot] opened a new pull request, #319:
URL: https://github.com/apache/iceberg-python/pull/319

   Bumps [pytest](https://github.com/pytest-dev/pytest) from 7.4.4 to 8.0.0.
   
   Release notes
   Sourced from https://github.com/pytest-dev/pytest/releases";>pytest's 
releases.
   
   pytest 8.0.0 (2024-01-27)
   See https://github.com/pytest-dev/pytest/releases/tag/8.0.0rc1";>8.0.0rc1 
and https://github.com/pytest-dev/pytest/releases/tag/8.0.0rc2";>8.0.0rc2 
for the full changes since pytest 7.4!
   Bug Fixes
   
   https://redirect.github.com/pytest-dev/pytest/issues/11842";>#11842: 
Properly escape the reason of a skip {.interpreted-text role="ref"} mark when writing JUnit 
XML files.
   https://redirect.github.com/pytest-dev/pytest/issues/11861";>#11861: 
Avoid microsecond exceeds 1_000_000 when using 
log-date-format with %f specifier, which might cause 
the test suite to crash.
   
   8.0.0rc2
   pytest 8.0.0rc2 (2024-01-17)
   Improvements
   
   https://redirect.github.com/pytest-dev/pytest/issues/11233";>#11233: 
Improvements to -r for xfailures and xpasses:
   
   Report tracebacks for xfailures when -rx is set.
   Report captured output for xpasses when -rX is set.
   For xpasses, add - in summary between test name and reason, 
to match how xfail is displayed.
   
   
   https://redirect.github.com/pytest-dev/pytest/issues/11825";>#11825: 
The pytest_plugin_registered{.interpreted-text 
role="hook"} hook has a new plugin_name parameter 
containing the name by which plugin is registered.
   
   Bug Fixes
   
   
   https://redirect.github.com/pytest-dev/pytest/issues/11706";>#11706: 
Fix reporting of teardown errors in higher-scoped fixtures when using 
[--maxfail]{.title-ref} or [--stepwise]{.title-ref}.
   
   
   https://redirect.github.com/pytest-dev/pytest/issues/11758";>#11758: 
Fixed IndexError: string index out of range crash in if 
highlighted[-1] == "\n" and source[-1] != "\n".
   This bug was introduced in pytest 8.0.0rc1.
   
   
   https://redirect.github.com/pytest-dev/pytest/issues/9765";>#9765, https://redirect.github.com/pytest-dev/pytest/issues/11816";>#11816: 
Fixed a frustrating bug that afflicted some users with the only error being 
assert mod not in mods. The issue was caused by the fact that 
str(Path(mod)) and mod.__file__ don't necessarily 
produce the same string, and was being erroneously used interchangably in some 
places in the code.
   This fix also broke the internal API of 
PytestPluginManager.consider_conftest by introducing a new 
parameter -- we mention this in case it is being used by external code, even if 
marked as private.
   
   
   pytest 8.0.0rc1 (2023-12-30)
   See https://docs.pytest.org/en/latest/changelog.html#pytest-8-0-0rc1-2023-12-30";>https://docs.pytest.org/en/latest/changelog.html#pytest-8-0-0rc1-2023-12-30
 for the rendered changelog.
   Breaking Changes
   Old Deprecations Are Now Errors
   
   
   https://redirect.github.com/pytest-dev/pytest/issues/7363";>#7363: 
PytestRemovedIn8Warning deprecation warnings are now errors by 
default.
   Following our plan to remove deprecated features with as little 
disruption as possible, all warnings of type 
PytestRemovedIn8Warning now generate errors instead of warning 
messages by default.
   The affected features will be effectively removed in pytest 
8.1, so please consult the deprecations{.interpreted-text 
role="ref"} section in the docs for directions on how to update 
existing code.
   In the pytest 8.0.X series, it is possible to change the 
errors back into warnings as a stopgap measure by adding this to your 
pytest.ini file:
   [pytest]
   
   
   
   
   
   ... (truncated)
   
   
   Commits
   
   https://github.com/pytest-dev/pytest/commit/478f8233bca8147445f0c5129f04ada892cc6c91";>478f823
 Prepare release version 8.0.0
   https://github.com/pytest-dev/pytest/commit/608590097a6542768099dd371b84d8b37a1990da";>6085900
 [8.0.x] fix: avoid rounding microsecond to 1_000_000 (https://redirect.github.com/pytest-dev/pytest/issues/11863";>#11863)
   https://github.com/pytest-dev/pytest/commit/3b41c65c81d649d962be5ec469f44104b8d09748";>3b41c65
 [8.0.x] Escape skip reason in junitxml (https://redirect.github.com/pytest-dev/pytest/issues/11845";>#11845)
   https://github.com/pytest-dev/pytest/commit/747072ad26f2443dc8a62eb88db8cbf56fa95470";>747072a
 [8.0.x] Update docstring of scripts/generate-gh-release-notes.py (https://redirect.github.com/pytest-dev/pytest/issues/11768";>#11768)
   https://github.com/pytest-dev/pytest/commit/011a475baf6e1d0e9ec30c5996d9cbcbe7c95475";>011a475
 Properly attach packages to the GH release notes (https://redirect.github.com/pytest-dev/pytest/issues/11839";>#11839) 
(https://redirect.github.com/pytest-dev/pytest/issues/11840";>#11840)
   https://github.com/pytest-dev/pytest/commit/97960bdd148972b2f26bd9b336163e590bbc4c6b";>97960bd
 Merge pull request https://redirect.github.com/pytest-dev/pytest/issues/11835";>#11835 
from pytest-d

[PR] Build: Bump adlfs from 2023.12.0 to 2024.1.0 [iceberg-python]

2024-01-29 Thread via GitHub



dependabot[bot] opened a new pull request, #320:
URL: https://github.com/apache/iceberg-python/pull/320

   Bumps [adlfs](https://github.com/fsspec/adlfs) from 2023.12.0 to 2024.1.0.
   
   Release notes
   Sourced from https://github.com/fsspec/adlfs/releases";>adlfs's releases.
   
   2024.1.0
   What's Changed
   
   adlfs: fix version typo by https://github.com/efiop";>@efiop in https://redirect.github.com/fsspec/adlfs/pull/449";>fsspec/adlfs#449
   Check for Hdi_isfolder with a capital by https://github.com/basnijholt";>@basnijholt in https://redirect.github.com/fsspec/adlfs/pull/418";>fsspec/adlfs#418
   put_file: default to overwrite=True by https://github.com/pmrowla";>@pmrowla in https://redirect.github.com/fsspec/adlfs/pull/419";>fsspec/adlfs#419
   Fix recursive delete on hierarchical namespace accounts by https://github.com/Tom-Newton";>@Tom-Newton in https://redirect.github.com/fsspec/adlfs/pull/454";>fsspec/adlfs#454
   
   New Contributors
   
   https://github.com/Tom-Newton";>@Tom-Newton 
made their first contribution in https://redirect.github.com/fsspec/adlfs/pull/454";>fsspec/adlfs#454
   
   Full Changelog: https://github.com/fsspec/adlfs/compare/2023.12.0...2024.1.0";>https://github.com/fsspec/adlfs/compare/2023.12.0...2024.1.0
   
   
   
   Commits
   
   See full diff in https://github.com/fsspec/adlfs/compare/2023.12.0...2024.1.0";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=adlfs&package-manager=pip&previous-version=2023.12.0&new-version=2024.1.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Build: Bump moto from 4.2.13 to 5.0.0 [iceberg-python]

2024-01-29 Thread via GitHub



dependabot[bot] opened a new pull request, #321:
URL: https://github.com/apache/iceberg-python/pull/321

   Bumps [moto](https://github.com/getmoto/moto) from 4.2.13 to 5.0.0.
   
   Changelog
   Sourced from https://github.com/getmoto/moto/blob/master/CHANGELOG.md";>moto's 
changelog.
   
   5.0.0
   Docker Digest for 5.0.0: 
sha256:2faf2460a6446dfe472ac0d799e00341b1c84203d08540c247a6cc09be7c54e9
   General:
   * All decorators have been replaced with a single decorator:
 `mock_aws`
   * The `mock_batch_simple` and `mock_lambda_simple` decorators can 
now be configured using the `config`-parameter:
 `@mock_aws(config={"batch": 
{"use_docker": False}, "lambda": 
{"use_docker": False}})`
   
   * It is now possible to configure methods/services which should reach out to 
AWS.
 @mock_aws(
 config={"core": 
{"mock_credentials": False, "passthrough": 
{"urls": [], "services": []}}}
 )
   
   * All requests now return a RequestId
   
   Miscellaneous:
   * IAM: The AWS managed Policies are no longer loaded by default.
 If your application depends on these policies, tell Moto explicitly to 
load them like so:
   
 @mock_aws(config={"iam": 
{"load_aws_managed_policies": True}})
   
 Or by setting an environment variable:
 MOTO_IAM_LOAD_MANAGED_POLICIES=true
   
   * S3: list_objects() now returns a hashed ContinuationToken
   
   
   4.2.14
   Docker Digest for 4.2.14: 
sha256:2fa10aa48e32f85c63c62a7d437b8a4b320a56a8494bc25d45ced370bc159c23
   New Services:
   * Backup:
   * create_backup_plan()
   * create_backup_vault()
   * get_backup_plan()
   * describe_backup_vault()
   * delete_backup_plan()
   * list_backup_plans()
   * list_backup_vaults()
   * list_tags()
   * tag_resource()
   * untag_resource()
   New Methods:
   * RDS:
   
   
   
   ... (truncated)
   
   
   Commits
   
   https://github.com/getmoto/moto/commit/0255717edd0252608789d06f1452cee420819980";>0255717
 Prep Release V5
   https://github.com/getmoto/moto/commit/38f6359dce229430c89e805cd35e773d8f86ea2a";>38f6359
 Refactoring: Adding typing for event message structure (https://redirect.github.com/getmoto/moto/issues/7234";>#7234)
   https://github.com/getmoto/moto/commit/38ee3fa66233d931a36bb3377b8d332488d92f5a";>38ee3fa
 Prep release 5.0.0alpha3
   https://github.com/getmoto/moto/commit/09c5751671fb89a75a9a565d550c0fcdbf10489e";>09c5751
 MotoServer: skip using HTTP_HOST header when determining request origin (https://redirect.github.com/getmoto/moto/issues/7225";>#7225)
   https://github.com/getmoto/moto/commit/8199a884468f5c2369040b8a3e50b57df3ab23fa";>8199a88
 Techdebt: Update scaffolding to use mock_aws (https://redirect.github.com/getmoto/moto/issues/7221";>#7221)
   https://github.com/getmoto/moto/commit/dea4a98b64fe8b8a8d7b02e5446756d1b439af55";>dea4a98
 Techdebt: Remove old/deprecated code (https://redirect.github.com/getmoto/moto/issues/7220";>#7220)
   https://github.com/getmoto/moto/commit/16ba3de855425448459237f5f522a15effd2fd12";>16ba3de
 Techdebt: Verify minimum boto3-version, up minimum responses version (https://redirect.github.com/getmoto/moto/issues/7218";>#7218)
   https://github.com/getmoto/moto/commit/f834d98314b22d748198a5244becad3373ffbba7";>f834d98
 IAM: Only load AWS Managed Policies on request (https://redirect.github.com/getmoto/moto/issues/7219";>#7219)
   https://github.com/getmoto/moto/commit/0c8ccbb4069bf2db5420634580d0ed2bf45346d6";>0c8ccbb
 Techdebt: Hide private decorator-methods/attributes (https://redirect.github.com/getmoto/moto/issues/7216";>#7216)
   https://github.com/getmoto/moto/commit/5aa3cc9d735e32b7a6ae03dd1de25d94944d930f";>5aa3cc9
 S3: list_objects_v2() should have a hashed NextContinuationToken (https://redirect.github.com/getmoto/moto/issues/7187";>#7187)
   Additional commits viewable in https://github.com/getmoto/moto/compare/4.2.13...5.0.0";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=moto&package-manager=pip&previous-version=4.2.13&new-version=5.0.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR

Re: [PR] Flink: Adds the ability to read from a branch on the Flink Iceberg Source [iceberg]

2024-01-29 Thread via GitHub



stevenzwu commented on code in PR #9547:
URL: https://github.com/apache/iceberg/pull/9547#discussion_r1470308941


##
flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamScanSql.java:
##
@@ -97,7 +98,38 @@ public void clean() {
 super.clean();
   }
 
+  //  private void insertRows(String partition, String branch, Table table, 
Row... rows)

Review Comment:
   this should be removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.5: Use execution ID in executor cache [iceberg]

2024-01-29 Thread via GitHub



aokolnychyi commented on PR #9583:
URL: https://github.com/apache/iceberg/pull/9583#issuecomment-1915715186

   I am going to run this a number of times to make sure the new logic is not 
flaky and then will test it one more time on a cluster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] [DO NOT MERGE] New docs switch [iceberg]

2024-01-29 Thread via GitHub



rdblue commented on PR #9520:
URL: https://github.com/apache/iceberg/pull/9520#issuecomment-1915715262

   @bitsondatadev, thanks! This looks more like what I was expecting! I'm +1 on 
this when the thread on the dev list concludes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Build: Bump pypa/cibuildwheel from 2.16.3 to 2.16.4 [iceberg-python]

2024-01-29 Thread via GitHub



dependabot[bot] opened a new pull request, #322:
URL: https://github.com/apache/iceberg-python/pull/322

   Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.16.3 
to 2.16.4.
   
   Release notes
   Sourced from https://github.com/pypa/cibuildwheel/releases";>pypa/cibuildwheel's 
releases.
   
   v2.16.4
   🛠 Update manylinux pins to upgrade from a problematic PyPy version. (https://redirect.github.com/pypa/cibuildwheel/issues/1737";>#1737)
   
   
   
   Changelog
   Sourced from https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md";>pypa/cibuildwheel's
 changelog.
   
   v2.16.4
   28 January 2024
   
   🛠 Update manylinux pins to upgrade from a problematic PyPy version. (https://redirect.github.com/pypa/cibuildwheel/issues/1737";>#1737)
   
   
   
   
   Commits
   
   https://github.com/pypa/cibuildwheel/commit/0b04ab1040366101259658b355777e4ff2d16f83";>0b04ab1
 Bump version: v2.16.4
   https://github.com/pypa/cibuildwheel/commit/34b049f1389fd9cfef5de1d1781ec67759770eea";>34b049f
 [Bot] Update dependencies (https://redirect.github.com/pypa/cibuildwheel/issues/1737";>#1737)
   https://github.com/pypa/cibuildwheel/commit/4a79413fe6d6fd88b4317bcf0e80252292fd34f5";>4a79413
 Remove the Cirrus CI badge from readme
   See full diff in https://github.com/pypa/cibuildwheel/compare/v2.16.3...v2.16.4";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pypa/cibuildwheel&package-manager=github_actions&previous-version=2.16.3&new-version=2.16.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Build: Bump aiohttp from 3.9.1 to 3.9.2 [iceberg-python]

2024-01-29 Thread via GitHub



dependabot[bot] opened a new pull request, #323:
URL: https://github.com/apache/iceberg-python/pull/323

   Bumps [aiohttp](https://github.com/aio-libs/aiohttp) from 3.9.1 to 3.9.2.
   
   Release notes
   Sourced from https://github.com/aio-libs/aiohttp/releases";>aiohttp's 
releases.
   
   3.9.2
   Bug fixes
   
   
   Fixed server-side websocket connection leak.
   Related issues and pull requests on GitHub:
   https://redirect.github.com/aio-libs/aiohttp/issues/7978";>#7978.
   
   
   Fixed web.FileResponse doing blocking I/O in the event 
loop.
   Related issues and pull requests on GitHub:
   https://redirect.github.com/aio-libs/aiohttp/issues/8012";>#8012.
   
   
   Fixed double compress when compression enabled and compressed file exists 
in server file responses.
   Related issues and pull requests on GitHub:
   https://redirect.github.com/aio-libs/aiohttp/issues/8014";>#8014.
   
   
   Added runtime type check for ClientSession 
timeout parameter.
   Related issues and pull requests on GitHub:
   https://redirect.github.com/aio-libs/aiohttp/issues/8021";>#8021.
   
   
   Fixed an unhandled exception in the Python HTTP parser on header lines 
starting with a colon -- by :user:pajod.
   Invalid request lines with anything but a dot between the HTTP major and 
minor version are now rejected.
   Invalid header field names containing question mark or slash are now 
rejected.
   Such requests are incompatible with :rfc:9110#section-5.6.2 and 
are not known to be of any legitimate use.
   Related issues and pull requests on GitHub:
   https://redirect.github.com/aio-libs/aiohttp/issues/8074";>#8074.
   
   
   Improved validation of paths for static resources requests to the server 
-- by :user:bdraco.
   
   
   
   
   ... (truncated)
   
   
   Changelog
   Sourced from https://github.com/aio-libs/aiohttp/blob/master/CHANGES.rst";>aiohttp's 
changelog.
   
   3.9.2 (2024-01-28)
   Bug fixes
   
   
   Fixed server-side websocket connection leak.
   Related issues and pull requests on GitHub:
   :issue:7978.
   
   
   Fixed web.FileResponse doing blocking I/O in the event 
loop.
   Related issues and pull requests on GitHub:
   :issue:8012.
   
   
   Fixed double compress when compression enabled and compressed file exists 
in server file responses.
   Related issues and pull requests on GitHub:
   :issue:8014.
   
   
   Added runtime type check for ClientSession 
timeout parameter.
   Related issues and pull requests on GitHub:
   :issue:8021.
   
   
   Fixed an unhandled exception in the Python HTTP parser on header lines 
starting with a colon -- by :user:pajod.
   Invalid request lines with anything but a dot between the HTTP major and 
minor version are now rejected.
   Invalid header field names containing question mark or slash are now 
rejected.
   Such requests are incompatible with :rfc:9110#section-5.6.2 and 
are not known to be of any legitimate use.
   Related issues and pull requests on GitHub:
   :issue:8074.
   
   
   
   
   ... (truncated)
   
   
   Commits
   
   https://github.com/aio-libs/aiohttp/commit/24a6d64966d99182e95f5d3a29541ef2fec397ad";>24a6d64
 Release v3.9.2 (https://redirect.github.com/aio-libs/aiohttp/issues/8082";>#8082)
   https://github.com/aio-libs/aiohttp/commit/9118a5831e8a65b8c839eb7e4ac983e040ff41df";>9118a58
 [PR https://redirect.github.com/aio-libs/aiohttp/issues/8079";>#8079/1c335944
 backport][3.9] Validate static paths (https://redirect.github.com/aio-libs/aiohttp/issues/8080";>#8080)
   https://github.com/aio-libs/aiohttp/commit/435ad46e6c26cbf6ed9a38764e9ba8e7441a0e3b";>435ad46
 [PR https://redirect.github.com/aio-libs/aiohttp/issues/3955";>#3955/8960063e
 backport][3.9] Replace all tmpdir fixtures with tmp_path (...
   https://github.com/aio-libs/aiohttp/commit/d33bc21414e283c9e6fe7f6caf69e2ed60d66c82";>d33bc21
 Improve validation in HTTP parser (https://redirect.github.com/aio-libs/aiohttp/issues/8074";>#8074) (https://redirect.github.com/aio-libs/aiohttp/issues/8078";>#8078)
   https://github.com/aio-libs/aiohttp/commit/0d945d1be08f2ba8475513216a66411f053c3217";>0d945d1
 [PR https://redirect.github.com/aio-libs/aiohttp/issues/7916";>#7916/822fbc74
 backport][3.9] Add more information to contributing page (...
   https://github.com/aio-libs/aiohttp/commit/3ec4fa1f0e0a0dad218c75dbe5ed09e22d5cc284";>3ec4fa1
 [PR https://redirect.github.com/aio-libs/aiohttp/issues/8069";>#8069/69bbe874
 backport][3.9] 📝 Only show changelog draft for non-release...
   https://github.com/aio-libs/aiohttp/commit/419d715c42c46daf1a59e0aff61c1f6d10236982";>419d715
 [PR https://redirect.github.com/aio-libs/aiohttp/issues/8066";>#8066/cba34699
 backport][3.9] 💅📝 Restructure the changelog for clarity (#...
   https://github.com/aio-libs/aiohttp/commit/a54dab3b36bcf0d815b9244f52ae7bc5da08f387";>a54dab3
 [PR https://redirect.github.com/aio-libs/aiohttp/issues/8049";>#8049/a379e634
 backport][3.9] Set cause for ClientPayloadError (https://redirect.github.com/aio-libs/a

Re: [PR] Flink: Adds the ability to read from a branch on the Flink Iceberg Source [iceberg]

2024-01-29 Thread via GitHub



stevenzwu merged PR #9547:
URL: https://github.com/apache/iceberg/pull/9547


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



HonahX commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470463965


##
tests/catalog/test_base.py:
##
@@ -330,6 +334,34 @@ def test_create_table(catalog: InMemoryCatalog) -> None:
 assert catalog.load_table(TEST_TABLE_IDENTIFIER) == table
 
 
+@pytest.mark.parametrize(
+"schema,expected",
+[
+(lazy_fixture("pyarrow_schema_simple_without_ids"), 
lazy_fixture("iceberg_schema_simple_no_ids")),
+(lazy_fixture("iceberg_schema_simple"), 
lazy_fixture("iceberg_schema_simple")),
+(lazy_fixture("iceberg_schema_nested"), 
lazy_fixture("iceberg_schema_nested")),
+(lazy_fixture("pyarrow_schema_nested_without_ids"), 
lazy_fixture("iceberg_schema_nested_no_ids")),
+],
+)
+def test_convert_schema_if_needed(
+schema: Union[Schema, pa.Schema],
+expected: Schema,
+catalog: InMemoryCatalog,
+) -> None:
+assert expected == catalog._convert_schema_if_needed(schema)
+
+
+def test_create_table_pyarrow_schema(catalog: InMemoryCatalog, 
pyarrow_schema_simple_without_ids: pa.Schema) -> None:
+table = catalog.create_table(
+identifier=TEST_TABLE_IDENTIFIER,
+schema=pyarrow_schema_simple_without_ids,
+location=TEST_TABLE_LOCATION,
+partition_spec=TEST_TABLE_PARTITION_SPEC,
+properties=TEST_TABLE_PROPERTIES,
+)
+assert catalog.load_table(TEST_TABLE_IDENTIFIER) == table
+

Review Comment:
   I think it may be better to have another assertion here:
   ```
   assert table.schema() == iceberg_schema_simple
   ```
   maybe also add fixture parametrization to include the `pyarrow_schema_nested`
   
   This is not a big deal. So probably we can discuss/modify this in some 
follow-up PRs.
   
   Currently, `test_sql.py` has covered this case:
   
https://github.com/apache/iceberg-python/pull/305/files#diff-fad2e9bc49f619f03bf0631fce12da1f524c786008dbdfc01a35f6caf19a7c01R161
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470488890


##
tests/catalog/test_base.py:
##
@@ -330,6 +334,34 @@ def test_create_table(catalog: InMemoryCatalog) -> None:
 assert catalog.load_table(TEST_TABLE_IDENTIFIER) == table
 
 
+@pytest.mark.parametrize(
+"schema,expected",
+[
+(lazy_fixture("pyarrow_schema_simple_without_ids"), 
lazy_fixture("iceberg_schema_simple_no_ids")),
+(lazy_fixture("iceberg_schema_simple"), 
lazy_fixture("iceberg_schema_simple")),
+(lazy_fixture("iceberg_schema_nested"), 
lazy_fixture("iceberg_schema_nested")),
+(lazy_fixture("pyarrow_schema_nested_without_ids"), 
lazy_fixture("iceberg_schema_nested_no_ids")),
+],
+)
+def test_convert_schema_if_needed(
+schema: Union[Schema, pa.Schema],
+expected: Schema,
+catalog: InMemoryCatalog,
+) -> None:
+assert expected == catalog._convert_schema_if_needed(schema)
+
+
+def test_create_table_pyarrow_schema(catalog: InMemoryCatalog, 
pyarrow_schema_simple_without_ids: pa.Schema) -> None:
+table = catalog.create_table(
+identifier=TEST_TABLE_IDENTIFIER,
+schema=pyarrow_schema_simple_without_ids,
+location=TEST_TABLE_LOCATION,
+partition_spec=TEST_TABLE_PARTITION_SPEC,
+properties=TEST_TABLE_PROPERTIES,
+)
+assert catalog.load_table(TEST_TABLE_IDENTIFIER) == table
+

Review Comment:
   Thank you for the great suggestion @HonahX . 
   
   I just tried to update the test in test_base.py, but it requires us to use 
new_table_metadata instead of the existing mechanism of creating a new 
TableMetadataV1 instance, which creates a new uuid and breaks the 
cli/test_console.py tests.
   
   So I think it might be simpler to fix this once we get 
https://github.com/apache/iceberg-python/pull/289 merged in



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



HonahX commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470492242


##
tests/catalog/test_base.py:
##
@@ -330,6 +334,34 @@ def test_create_table(catalog: InMemoryCatalog) -> None:
 assert catalog.load_table(TEST_TABLE_IDENTIFIER) == table
 
 
+@pytest.mark.parametrize(
+"schema,expected",
+[
+(lazy_fixture("pyarrow_schema_simple_without_ids"), 
lazy_fixture("iceberg_schema_simple_no_ids")),
+(lazy_fixture("iceberg_schema_simple"), 
lazy_fixture("iceberg_schema_simple")),
+(lazy_fixture("iceberg_schema_nested"), 
lazy_fixture("iceberg_schema_nested")),
+(lazy_fixture("pyarrow_schema_nested_without_ids"), 
lazy_fixture("iceberg_schema_nested_no_ids")),
+],
+)
+def test_convert_schema_if_needed(
+schema: Union[Schema, pa.Schema],
+expected: Schema,
+catalog: InMemoryCatalog,
+) -> None:
+assert expected == catalog._convert_schema_if_needed(schema)
+
+
+def test_create_table_pyarrow_schema(catalog: InMemoryCatalog, 
pyarrow_schema_simple_without_ids: pa.Schema) -> None:
+table = catalog.create_table(
+identifier=TEST_TABLE_IDENTIFIER,
+schema=pyarrow_schema_simple_without_ids,
+location=TEST_TABLE_LOCATION,
+partition_spec=TEST_TABLE_PARTITION_SPEC,
+properties=TEST_TABLE_PROPERTIES,
+)
+assert catalog.load_table(TEST_TABLE_IDENTIFIER) == table
+

Review Comment:
   Thanks for trying this!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470494285


##
pyiceberg/schema.py:
##
@@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: 
Union[Schema, IcebergType], next_id:
 class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
 """Traverses the schema and assigns monotonically increasing ids."""
 
-reserved_ids: Dict[int, int]
+old_id_to_new_id: Dict[int, int]

Review Comment:
   I tried this out, and I think we might have to leave this logic in for this 
iteration. 
   
   The usage of create_table in PyIceberg is slightly different from the Java 
implementation in that the user can create an IcebergSchema, instead of a Spark 
DF schema or a PyArrow schema. This means that the user can choose to include 
identifier_field_ids in the IcebergSchema that they define.
   
   I think it might make sense to consolidate the way we track the 
identifier_field_ids on create_table, update_schema, replace_table and other 
similar operations in a future PR. I'll create an issue to track this task



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470494285


##
pyiceberg/schema.py:
##
@@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: 
Union[Schema, IcebergType], next_id:
 class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
 """Traverses the schema and assigns monotonically increasing ids."""
 
-reserved_ids: Dict[int, int]
+old_id_to_new_id: Dict[int, int]

Review Comment:
   I tried this out, and I think we might have to leave this logic in for this 
iteration. 
   
   The usage of create_table in PyIceberg is slightly different from the Java 
implementation in that the user can create an IcebergSchema, instead of a Spark 
DF schema or a PyArrow schema. This means that the user can choose to include 
identifier_field_ids in the IcebergSchema that they define.
   
   I think it might make sense to consolidate the way we update the 
identifier_field_ids on create_table, update_schema, replace_table and other 
similar operations in a future PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#issuecomment-1915934234

   Looks good to merge from my side. I left my closing comments on all the open 
discussions. Thank you @Fokko @kevinjqliu and @HonahX for your reviews!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



HonahX commented on PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#issuecomment-1915934692

   Thanks @syun64 for the great work! Thanks @Fokko and @kevinjqliu for 
reviewing! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



HonahX merged PR #305:
URL: https://github.com/apache/iceberg-python/pull/305


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]

2024-01-29 Thread via GitHub



syun64 commented on code in PR #305:
URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470494285


##
pyiceberg/schema.py:
##
@@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: 
Union[Schema, IcebergType], next_id:
 class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
 """Traverses the schema and assigns monotonically increasing ids."""
 
-reserved_ids: Dict[int, int]
+old_id_to_new_id: Dict[int, int]

Review Comment:
   I tried this out, and I think we might have to leave this logic in for this 
iteration. 
   
   In PyIceberg, a user can create a table using IcebergSchema, instead of a 
PyArrow schema. This means that the user can choose to include 
identifier_field_ids in the IcebergSchema that they define, and the current 
create_table or assign_fresh_ids function calls don’t have the logic to handle 
this outside of _SetFreshIDs
   
   I think it might make sense to consolidate the way we update the 
identifier_field_ids on create_table, update_schema, replace_table and other 
similar operations in a future PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: add support for catalogs with glue implementation to start [iceberg-go]

2024-01-29 Thread via GitHub



zeroshade commented on PR #51:
URL: https://github.com/apache/iceberg-go/pull/51#issuecomment-1915981210

   @wolfeidau Can you add check marks to the appropriate spots in the README 
for the functionality you're adding?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: add parquet writer [iceberg-rust]

2024-01-29 Thread via GitHub



liurenjie1024 commented on code in PR #176:
URL: https://github.com/apache/iceberg-rust/pull/176#discussion_r1470518241


##
crates/iceberg/src/writer/file_writer/parquet_writer.rs:
##


Review Comment:
   I'm thinking moving this to another crate `iceberg-parquet`, the core crate 
only contains `FileWriter` trait. WDYT? cc @ZENOTME @Xuanwo @Fokko 



##
crates/iceberg/src/spec/manifest.rs:
##
@@ -924,7 +924,7 @@ impl TryFrom for ManifestStatus {
 }
 
 /// Data file carries data file path, partition tuple, metrics, …
-#[derive(Debug, PartialEq, Clone, Eq, TypedBuilder)]
+#[derive(Debug, PartialEq, Clone, Eq, Builder)]

Review Comment:
   Why this change? `TypedBuilder` checks it in compile time, while `Builder` 
checks result in runtime.



##
crates/iceberg/src/io.rs:
##
@@ -278,7 +288,7 @@ impl OutputFile {
 }
 
 /// Creates output file for writing.
-pub async fn writer(&self) -> Result {
+pub async fn writer(&self) -> Result {

Review Comment:
   We can't make it concrete `Writer` here since we can't expose opendal api to 
user. You can store `Box` in parquet writer.



##
crates/iceberg/src/io.rs:
##
@@ -268,6 +268,16 @@ impl OutputFile {
 .await?)
 }
 
+/// Delete file.
+pub async fn delete(&self) -> Result<()> {

Review Comment:
   This seems unsound to me. If it's deleted, we should consume it, so its 
argument should be `self` rather `&self`



##
crates/iceberg/src/writer/file_writer/parquet_writer.rs:
##
@@ -0,0 +1,267 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! The module contains the file writer for parquet file format.
+
+use std::{
+collections::HashMap,
+sync::{atomic::AtomicI64, Arc},
+};
+
+use crate::{
+io::OutputFile,
+spec::{DataFileBuilder, DataFileFormat},
+writer::CurrentFileStatus,
+Error,
+};
+use arrow_schema::SchemaRef;
+use opendal::Writer;
+use parquet::file::properties::WriterProperties;
+use parquet::{arrow::AsyncArrowWriter, format::FileMetaData};
+
+use super::{track_writer::TrackWriter, FileWriter, FileWriterBuilder};
+
+/// ParquetWriterBuilder is used to builder a [`ParquetWriter`]
+#[derive(Clone)]
+pub struct ParquetWriterBuilder {
+out_file: OutputFile,
+/// `buffer_size` determines the initial size of the intermediate buffer.
+/// The intermediate buffer will automatically be resized if necessary
+init_buffer_size: usize,
+props: WriterProperties,
+schema: SchemaRef,
+}
+
+impl ParquetWriterBuilder {
+/// Create a new `ParquetWriterBuilder`
+pub fn new(
+out_file: OutputFile,
+init_buffer_size: usize,
+props: WriterProperties,
+schema: SchemaRef,
+) -> Self {
+Self {
+out_file,
+init_buffer_size,
+props,
+schema,
+}
+}
+}
+
+impl FileWriterBuilder for ParquetWriterBuilder {
+type R = ParquetWriter;
+
+async fn build(self) -> crate::Result {
+let written_size = Arc::new(AtomicI64::new(0));
+let inner_writer = TrackWriter::new(self.out_file.writer().await?, 
written_size.clone());
+let writer = AsyncArrowWriter::try_new(
+inner_writer,
+self.schema,
+self.init_buffer_size,
+Some(self.props),
+)
+.map_err(|err| {
+Error::new(
+crate::ErrorKind::Unexpected,
+"build error from parquet writer",
+)
+.with_source(err)
+})?;
+Ok(ParquetWriter {
+out_file: self.out_file,
+writer,
+written_size,
+current_row_num: 0,
+})
+}
+}
+
+/// `ParquetWriter`` is used to write arrow data into parquet file on storage.
+pub struct ParquetWriter {
+out_file: OutputFile,
+writer: AsyncArrowWriter>,
+written_size: Arc,
+current_row_num: usize,
+}
+
+impl ParquetWriter {
+fn to_data_file_builder(
+metadata: FileMetaData,
+written_size: usize,
+file_path: String,
+) -> DataFileBuilder {
+let (column_sizes, value_counts, null_value_counts) = {
+let m

Re: [PR] Core: HadoopTable needs to skip file cleanup after task failure under some boundary conditions. [iceberg]

2024-01-29 Thread via GitHub



BsoBird commented on PR #9546:
URL: https://github.com/apache/iceberg/pull/9546#issuecomment-1915987434

   @szehon-ho  Hi. can you check this?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Rest Catalog: Add RESTful AppendFiles data operation [iceberg]

2024-01-29 Thread via GitHub



danielcweeks commented on code in PR #9292:
URL: https://github.com/apache/iceberg/pull/9292#discussion_r1470559173


##
core/src/main/java/org/apache/iceberg/MetadataUpdate.java:
##
@@ -490,4 +491,21 @@ public void applyTo(ViewMetadata.Builder 
viewMetadataBuilder) {
   viewMetadataBuilder.setCurrentVersionId(versionId);
 }
   }
+
+  class AppendFilesUpdate implements MetadataUpdate {

Review Comment:
   We might want to rename this to `AppendManifestsUpdate` as appending files 
would be more of a `DataFile` type object.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Rest Catalog: Add RESTful AppendFiles data operation [iceberg]

2024-01-29 Thread via GitHub



danielcweeks commented on code in PR #9292:
URL: https://github.com/apache/iceberg/pull/9292#discussion_r1470562788


##
core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java:
##
@@ -112,6 +111,7 @@ public class RESTSessionCatalog extends 
BaseViewSessionCatalog
   private static final Logger LOG = 
LoggerFactory.getLogger(RESTSessionCatalog.class);
   private static final String DEFAULT_FILE_IO_IMPL = 
"org.apache.iceberg.io.ResolvingFileIO";
   private static final String REST_METRICS_REPORTING_ENABLED = 
"rest-metrics-reporting-enabled";
+  private static final String REST_DATA_COMMIT_ENABLED = 
"rest-data-commit-enabled";

Review Comment:
   A few places we probably need to update the naming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add UnboundSortOrder [iceberg-rust]

2024-01-29 Thread via GitHub



fqaiser94 commented on code in PR #115:
URL: https://github.com/apache/iceberg-rust/pull/115#discussion_r1470584482


##
crates/iceberg/src/spec/sort.rs:
##
@@ -88,15 +91,106 @@ impl SortOrder {
 pub fn is_unsorted(&self) -> bool {
 self.fields.is_empty()
 }
+
+/// Converts to an unbound sort order
+pub fn to_unbound(&self) -> UnboundSortOrder {
+UnboundSortOrder::builder()
+.with_order_id(self.order_id)
+.with_fields(
+self.fields
+.iter()
+.map(|x| UnboundSortField {
+source_id: x.source_id,
+transform: x.transform,
+direction: x.direction,
+null_order: x.null_order,
+})
+.collect_vec(),
+)
+.build()
+.unwrap()
+}
+}
+
+/// Reference to [`UnboundSortOrder`].
+pub type UnboundSortOrderRef = Arc;
+
+/// Entry for every column that is to be sorted
+#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, TypedBuilder)]
+#[serde(rename_all = "kebab-case")]
+pub struct UnboundSortField {
+/// A source column id from the table’s schema
+pub source_id: i32,
+/// A transform that is used to produce values to be sorted on from the 
source column.
+pub transform: Transform,
+/// A sort direction, that can only be either asc or desc
+pub direction: SortDirection,
+/// A null order that describes the order of null values when sorted.
+pub null_order: NullOrder,
+}
+
+/// Unbound sort order can be later bound to a schema.
+#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Builder, 
Default)]
+#[serde(rename_all = "kebab-case")]
+#[builder(setter(prefix = "with"))]
+#[builder(build_fn(skip))]
+pub struct UnboundSortOrder {
+/// Identifier for the SortOrder, order_id `0` is no sort order.
+pub order_id: i64,
+/// Details of the sort
+#[builder(setter(each(name = "with_sort_field")))]
+pub fields: Vec,
+}

Review Comment:
   I'm not opposed to the idea of having separate builder methods.
   Implemented a rough version of that in the latest 
[commit](https://github.com/apache/iceberg-rust/pull/115/commits/5a6790323e2d0b697019a8b933f599a220d3540a).
 
   Take a look and let me know what you think. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add UnboundSortOrder [iceberg-rust]

2024-01-29 Thread via GitHub



fqaiser94 commented on PR #115:
URL: https://github.com/apache/iceberg-rust/pull/115#issuecomment-1916069215

   Found some time to look at this again :D 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] InMemory Catalog Implementation [iceberg-python]

2024-01-29 Thread via GitHub



kevinjqliu commented on code in PR #289:
URL: https://github.com/apache/iceberg-python/pull/289#discussion_r1470604775


##
pyiceberg/catalog/in_memory.py:
##
@@ -0,0 +1,222 @@
+import uuid
+from typing import (
+Dict,
+List,
+Optional,
+Set,
+Union,
+)
+
+from pyiceberg.catalog import (
+Catalog,
+Identifier,
+Properties,
+PropertiesUpdateSummary,
+)
+from pyiceberg.exceptions import (
+NamespaceAlreadyExistsError,
+NamespaceNotEmptyError,
+NoSuchNamespaceError,
+NoSuchTableError,
+TableAlreadyExistsError,
+)
+from pyiceberg.io import WAREHOUSE
+from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.table import (
+CommitTableRequest,
+CommitTableResponse,
+Table,
+update_table_metadata,
+)
+from pyiceberg.table.metadata import new_table_metadata
+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.typedef import EMPTY_DICT
+
+DEFAULT_WAREHOUSE_LOCATION = "file:///tmp/warehouse"
+
+
+class InMemoryCatalog(Catalog):
+"""An in-memory catalog implementation."""
+
+__tables: Dict[Identifier, Table]
+__namespaces: Dict[Identifier, Properties]
+
+def __init__(self, name: str, **properties: str) -> None:
+super().__init__(name, **properties)
+self.__tables = {}
+self.__namespaces = {}
+self._warehouse_location = properties.get(WAREHOUSE, None) or 
DEFAULT_WAREHOUSE_LOCATION
+
+def create_table(
+self,
+identifier: Union[str, Identifier],
+schema: Schema,
+location: Optional[str] = None,
+partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC,
+sort_order: SortOrder = UNSORTED_SORT_ORDER,
+properties: Properties = EMPTY_DICT,
+table_uuid: Optional[uuid.UUID] = None,
+) -> Table:
+identifier = Catalog.identifier_to_tuple(identifier)
+namespace = Catalog.namespace_from(identifier)
+
+if identifier in self.__tables:
+raise TableAlreadyExistsError(f"Table already exists: 
{identifier}")
+else:
+if namespace not in self.__namespaces:
+self.__namespaces[namespace] = {}
+
+if not location:
+location = f'{self._warehouse_location}/{"/".join(identifier)}'
+
+metadata_location = 
f'{self._warehouse_location}/{"/".join(identifier)}/metadata/metadata.json'

Review Comment:
   yep, the actual writing is done by `_commit_table` below, but the path of 
the metadata location is determined here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.5: Fix flaky TestSparkExecutorCache [iceberg]

2024-01-29 Thread via GitHub



aokolnychyi commented on PR #9583:
URL: https://github.com/apache/iceberg/pull/9583#issuecomment-1916101647

   Re-triggering one more time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.5: Fix flaky TestSparkExecutorCache [iceberg]

2024-01-29 Thread via GitHub



aokolnychyi closed pull request #9583: Spark 3.5: Fix flaky 
TestSparkExecutorCache
URL: https://github.com/apache/iceberg/pull/9583


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Spark 3.5: Fix flaky TestSparkExecutorCache [iceberg]

2024-01-29 Thread via GitHub



aokolnychyi opened a new pull request, #9583:
URL: https://github.com/apache/iceberg/pull/9583

   This PR adds code to destroy all currently live table broadcasts after 
initialization to fix #9511.
   
   Our tests have the following behavior:
   
   ```
   Append
   Append
   Execute a row-level operation
   ``` 
   
   The cleanup logic for each of the appends may trigger invalidation in 
delete, potentially making all requests go to the underlying storage and 
skipping the cache.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] InMemory Catalog Implementation [iceberg-python]

2024-01-29 Thread via GitHub



kevinjqliu commented on PR #289:
URL: https://github.com/apache/iceberg-python/pull/289#issuecomment-1916109508

   > Should we also add this catalog to the tests in 
tests/integration/test_reads.py?
   
   @Fokko I had the same idea! Unfortunately, the way integration tests are 
configured right now requires table information to persist in the catalog from 
`provision.py` to the pytests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Cannot write to local filesystem [iceberg-python]

2024-01-29 Thread via GitHub



kevinjqliu closed issue #299: Cannot write to local filesystem
URL: https://github.com/apache/iceberg-python/issues/299


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: add parquet writer [iceberg-rust]

2024-01-29 Thread via GitHub



Xuanwo commented on code in PR #176:
URL: https://github.com/apache/iceberg-rust/pull/176#discussion_r1470623846


##
crates/iceberg/src/writer/file_writer/parquet_writer.rs:
##


Review Comment:
   The files supported by iceberg are limited, which is not as extensive as 
catalogs. I think it's best to keep them in the same crate for easier use and 
management of dependencies.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: add parquet writer [iceberg-rust]

2024-01-29 Thread via GitHub



Xuanwo commented on code in PR #176:
URL: https://github.com/apache/iceberg-rust/pull/176#discussion_r1470624504


##
crates/iceberg/src/io.rs:
##
@@ -268,6 +268,16 @@ impl OutputFile {
 .await?)
 }
 
+/// Delete file.
+pub async fn delete(&self) -> Result<()> {

Review Comment:
   I'm thinking of the bad cases: If the `delete` returns error, how can we 
retry this operation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] docs: Add community page [iceberg-python]

2024-01-29 Thread via GitHub



Fokko merged PR #315:
URL: https://github.com/apache/iceberg-python/pull/315


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: add parquet writer [iceberg-rust]

2024-01-29 Thread via GitHub



Xuanwo commented on code in PR #176:
URL: https://github.com/apache/iceberg-rust/pull/176#discussion_r1470626858


##
crates/iceberg/src/io.rs:
##
@@ -278,7 +288,7 @@ impl OutputFile {
 }
 
 /// Creates output file for writing.
-pub async fn writer(&self) -> Result {
+pub async fn writer(&self) -> Result {
 Ok(self.op.writer(&self.path[self.relative_path_pos..]).await?)

Review Comment:
   It's better to add a `buffer` for this writer to avoid `EntiryTooSmall` 
error from s3.
   
   ---
   
   There is another thing we can do is adding `concurrent` here so we can 
upload parts concurrently. But we need to make a design on how users control 
the concurrency. Maybe we can discuss them later.



##
crates/iceberg/src/io.rs:
##
@@ -268,6 +268,16 @@ impl OutputFile {
 .await?)
 }
 
+/// Delete file.
+pub async fn delete(&self) -> Result<()> {
+// #TODO
+// Do we need to check if file exists?

Review Comment:
   It depends on whether we want to consider it an error if the file doesn't 
exist.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: add parquet writer [iceberg-rust]

2024-01-29 Thread via GitHub



Xuanwo commented on code in PR #176:
URL: https://github.com/apache/iceberg-rust/pull/176#discussion_r1470626858


##
crates/iceberg/src/io.rs:
##
@@ -278,7 +288,7 @@ impl OutputFile {
 }
 
 /// Creates output file for writing.
-pub async fn writer(&self) -> Result {
+pub async fn writer(&self) -> Result {
 Ok(self.op.writer(&self.path[self.relative_path_pos..]).await?)

Review Comment:
   It's better to add a `buffer` (for example, 8MiB)for this writer to avoid 
`EntiryTooSmall` error from s3.
   
   ---
   
   There is another thing we can do is adding `concurrent` here so we can 
upload parts concurrently. But we need to make a design on how users control 
the concurrency. Maybe we can discuss them later.



##
crates/iceberg/src/io.rs:
##
@@ -278,7 +288,7 @@ impl OutputFile {
 }
 
 /// Creates output file for writing.
-pub async fn writer(&self) -> Result {
+pub async fn writer(&self) -> Result {
 Ok(self.op.writer(&self.path[self.relative_path_pos..]).await?)

Review Comment:
   It's better to add a `buffer` (for example, 8MiB) for this writer to avoid 
`EntiryTooSmall` error from s3.
   
   ---
   
   There is another thing we can do is adding `concurrent` here so we can 
upload parts concurrently. But we need to make a design on how users control 
the concurrency. Maybe we can discuss them later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: add parquet writer [iceberg-rust]

2024-01-29 Thread via GitHub



Xuanwo commented on code in PR #176:
URL: https://github.com/apache/iceberg-rust/pull/176#discussion_r1470626858


##
crates/iceberg/src/io.rs:
##
@@ -278,7 +288,7 @@ impl OutputFile {
 }
 
 /// Creates output file for writing.
-pub async fn writer(&self) -> Result {
+pub async fn writer(&self) -> Result {
 Ok(self.op.writer(&self.path[self.relative_path_pos..]).await?)

Review Comment:
   It's better to add a `buffer` (for example, 8MiB) for this writer to avoid 
`EntiryTooSmall` error from s3.
   
   And in the future, we can add a config for this to allow users to control 
them.
   
   ---
   
   There is another thing we can do is adding `concurrent` here so we can 
upload parts concurrently. But we need to make a design on how users control 
the concurrency. Maybe we can discuss them later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Build: Bump aiohttp from 3.9.1 to 3.9.2 [iceberg-python]

2024-01-29 Thread via GitHub



Fokko merged PR #323:
URL: https://github.com/apache/iceberg-python/pull/323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Build: Bump adlfs from 2023.12.0 to 2024.1.0 [iceberg-python]

2024-01-29 Thread via GitHub



Fokko merged PR #320:
URL: https://github.com/apache/iceberg-python/pull/320


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Build: Bump pydantic from 2.5.3 to 2.6.0 [iceberg-python]

2024-01-29 Thread via GitHub



Fokko merged PR #317:
URL: https://github.com/apache/iceberg-python/pull/317


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

1 2 >

1 - 100 of 121 matches

Mail list logo