Re: [PR] Build/Release: Upgrade to RAT 0.16.1 [iceberg]
nastra merged PR #9579: URL: https://github.com/apache/iceberg/pull/9579 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump org.assertj:assertj-core from 3.25.1 to 3.25.2 [iceberg]
nastra merged PR #9576: URL: https://github.com/apache/iceberg/pull/9576 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump org.apache.httpcomponents.client5:httpclient5 from 5.2.3 to 5.3.1 [iceberg]
nastra merged PR #9572: URL: https://github.com/apache/iceberg/pull/9572 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Support altering views [iceberg]
nastra commented on code in PR #9510: URL: https://github.com/apache/iceberg/pull/9510#discussion_r1469223182 ## spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteViewCommands.scala: ## @@ -123,4 +132,17 @@ case class RewriteViewCommands(spark: SparkSession) extends Rule[LogicalPlan] wi collectTempViews(child) } + + private object IsIcebergView { Review Comment: I like those new names better. I've updated the existing one to `ResolvedIdent` as otherwise it clashes with the existing `ResolvedIdentifier`. Unfortunately both `unapply` methods can't be combined -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Small getting started guide on writes [iceberg-python]
Fokko commented on code in PR #311: URL: https://github.com/apache/iceberg-python/pull/311#discussion_r1469257727 ## mkdocs/docs/index.md: ## @@ -38,36 +38,129 @@ You can install the latest release version from pypi: pip install "pyiceberg[s3fs,hive]" ``` -Install it directly for GitHub (not recommended), but sometimes handy: +You can mix and match optional dependencies depending on your needs: + +| Key | Description: | +| | | +| hive | Support for the Hive metastore | +| glue | Support for AWS Glue | +| dynamodb | Support for AWS DynamoDB | +| sql-postgres | Support for SQL Catalog backed by Postgresql | +| sql-sqlite | Support for SQL Catalog backed by SQLite | +| pyarrow | PyArrow as a FileIO implementation to interact with the object store | +| pandas | Installs both PyArrow and Pandas | +| duckdb | Installs both PyArrow and DuckDB | +| ray | Installs PyArrow, Pandas, and Ray | +| s3fs | S3FS as a FileIO implementation to interact with the object store| +| adlfs| ADLFS as a FileIO implementation to interact with the object store | +| snappy | Support for snappy Avro compression | +| gcs | GCS as the FileIO implementation to interact with the object store | + +You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` to be able to fetch files from an object store. + +## Connecting to a catalog + +Iceberg leverages the [catalog to have one centralized place to organize the tables](https://iceberg.apache.org/catalog/). This can be a traditional Hive catalog to store your Iceberg tables next to the rest, a vendor solution like the AWS Glue catalog, or an implementation of Icebergs' own [REST protocol](https://github.com/apache/iceberg/tree/main/open-api). Checkout the [configuration](configuration.md) page to find all the configuration details. +## Write a PyArrow dataframe + +Let's take the Taxi dataset, and write this to an Iceberg table. + +First download one month of data: + +```shell +curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /tmp/yellow_tripdata_2023-01.parquet ``` -pip install "git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[s3fs]"; + +Load it into your PyArrow dataframe: + +```python +import pyarrow.parquet as pq + +df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet") ``` -Or clone the repository for local development: +Create a new Iceberg table: -```sh -git clone https://github.com/apache/iceberg-python.git -cd iceberg-python -pip3 install -e ".[s3fs,hive]" +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog("default") + +table = catalog.create_table( +"default.taxi_dataset", +schema=df.schema, # Blocked by https://github.com/apache/iceberg-python/pull/305 +) ``` -You can mix and match optional dependencies depending on your needs: +Append the dataframe to the table: + +```python +table.append(df) +len(table.scan().to_arrow()) +``` + +3066766 rows have been written to the table. + +Now generate a tip-per-mile feature to train the model on: + +```python +import pyarrow.compute as pc + +df = df.append_column("tip_per_mile", pc.divide(df["tip_amount"], df["trip_distance"])) +``` + +Evolve the schema of the table with the new column: + +```python +from pyiceberg.catalog import Catalog + +with table.update_schema() as update_schema: +# Blocked by https://github.com/apache/iceberg-python/pull/305 +update_schema.union_by_name(Catalog._convert_schema_if_needed(df.schema)) Review Comment: I fully agree. I should have been more explicit in the comment above. Once #305 is in, we can update the `union_by_name` to also accept `pa.Schema`. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Reading large data through Glue Catalog is SLOW [iceberg]
anechii closed issue #9559: Reading large data through Glue Catalog is SLOW URL: https://github.com/apache/iceberg/issues/9559 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
Fokko commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469288671 ## pyiceberg/io/pyarrow.py: ## @@ -761,6 +761,32 @@ def primitive(self, primitive: pa.DataType) -> T: """Visit a primitive type.""" +class PreOrderPyArrowSchemaVisitor(Generic[T], ABC): Review Comment: ```suggestion class _PreOrderPyArrowSchemaVisitor(Generic[T], ABC): ``` ## pyiceberg/schema.py: ## @@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: Union[Schema, IcebergType], next_id: class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]): """Traverses the schema and assigns monotonically increasing ids.""" -reserved_ids: Dict[int, int] +old_id_to_new_id: Dict[int, int] def __init__(self, next_id_func: Optional[Callable[[], int]] = None) -> None: -self.reserved_ids = {} +self.old_id_to_new_id = {} counter = itertools.count(1) self.next_id_func = next_id_func if next_id_func is not None else lambda: next(counter) -def _get_and_increment(self) -> int: -return self.next_id_func() +def _get_and_increment(self, current_id: int) -> int: +new_id = self.next_id_func() +self.old_id_to_new_id[current_id] = new_id +return new_id def schema(self, schema: Schema, struct_result: Callable[[], StructType]) -> Schema: -# First we keep the original identifier_field_ids here, we remap afterwards -fields = struct_result().fields -return Schema(*fields, identifier_field_ids=[self.reserved_ids[field_id] for field_id in schema.identifier_field_ids]) +return Schema( +*struct_result().fields, +identifier_field_ids=[self.old_id_to_new_id[field_id] for field_id in schema.identifier_field_ids], +) def struct(self, struct: StructType, field_results: List[Callable[[], IcebergType]]) -> StructType: -# assign IDs for this struct's fields first -self.reserved_ids.update({field.field_id: self._get_and_increment() for field in struct.fields}) -return StructType(*[field() for field in field_results]) +new_ids = [self._get_and_increment(field.field_id) for field in struct.fields] +new_fields = [] +for field_id, field, field_type in zip(new_ids, struct.fields, field_results): +new_fields.append( +NestedField( +field_id=field_id, +name=field.name, +field_type=field_type(), +required=field.required, +doc=field.doc, +) Review Comment: I think it is cleaner to keep this part in the `field` method, since it creates a field, and the `field()` now doesn't return a field, but a type. ## pyiceberg/io/pyarrow.py: ## @@ -906,6 +932,21 @@ def after_map_value(self, element: pa.Field) -> None: self._field_names.pop() +class _ConvertToIcebergWithNoIds(_ConvertToIceberg): Review Comment: Style suggestion, feel free to ignore: ```suggestion class _ConvertToIcebergWithoutIDs(_ConvertToIceberg): ``` ## pyiceberg/io/pyarrow.py: ## @@ -906,6 +986,76 @@ def after_map_value(self, element: pa.Field) -> None: self._field_names.pop() +class _ConvertToIcebergWithFreshIds(PreOrderPyArrowSchemaVisitor[Union[IcebergType, Schema]]): Review Comment: I also noticed that the Python implementation differs from the Java side. I don't know what the historical reason for this is. I couldn't find anything on the original PR about why this was done that way https://github.com/apache/iceberg/pull/5627. I'm okay with aligning this with the Java implementation. > Overall, this approach sounds reasonable to me if we can find an easy way to refactor the _SetFreshIds. @Fokko, I'd appreciate your perspective on refactoring _SetFreshIds. Do you see any issues with this approach? That's an okay approach, as long as the visitor to do this is hidden inside the package. We should not expose setting `-1` field IDs to the outside. What I like about the current implementation is that the visitor can be used on its own. Converting a field with all `-1` IDs doesn't provide much value on its own. I would love to get this in with the 0.6.0 release to simplify the creation of tables. ## pyiceberg/schema.py: ## @@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: Union[Schema, IcebergType], next_id: class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]): """Traverses the schema and assigns monotonically increasing ids.""" -reserved_ids: Dict[int, int] +old_id_to_new_id: Dict[int, int] Review Comment: I think at some point we just want to align this with Java. But let's do that in a separate PR: https://github.com/apache/iceberg/blob/main/api/src/main/java/org/
Re: [PR] Spark: Support altering views [iceberg]
nastra commented on code in PR #9510: URL: https://github.com/apache/iceberg/pull/9510#discussion_r1469223182 ## spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteViewCommands.scala: ## @@ -123,4 +132,17 @@ case class RewriteViewCommands(spark: SparkSession) extends Rule[LogicalPlan] wi collectTempViews(child) } + + private object IsIcebergView { Review Comment: I like those new names better. I've updated the existing one to `ResolvedIdent` as otherwise it clashes with the existing `ResolvedIdentifier`. Unfortunately both `unapply` methods can't be combined into `ResolvedView` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Support altering views [iceberg]
nastra commented on PR #9510: URL: https://github.com/apache/iceberg/pull/9510#issuecomment-1914349245 @rdblue I have extracted setting/unsetting view properties into https://github.com/apache/iceberg/pull/9582 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] gc.enabled property is set to false by default for Apache Iceberg table created in Nessie Catalog [iceberg]
nastra closed issue #9562: gc.enabled property is set to false by default for Apache Iceberg table created in Nessie Catalog URL: https://github.com/apache/iceberg/issues/9562 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]
nastra commented on PR #9565: URL: https://github.com/apache/iceberg/pull/9565#issuecomment-1914388103 @ilyasahsan123 we typically apply such changes on the latest version first (1.18) and then backport them in a separate PR. Could you apply these changes to Flink 1.18 first please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]
ilyasahsan123 commented on PR #9565: URL: https://github.com/apache/iceberg/pull/9565#issuecomment-1914437348 Sure, I will do it. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Add support for describing/showing views [iceberg]
nastra commented on code in PR #9513: URL: https://github.com/apache/iceberg/pull/9513#discussion_r1469444718 ## spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ExtendedDataSourceV2Strategy.scala: ## @@ -123,6 +132,47 @@ case class ExtendedDataSourceV2Strategy(spark: SparkSession) extends Strategy wi allowExisting = allowExisting, replace = replace) :: Nil +case AlterViewAs(ResolvedV2View(catalog, ident, _), queryText, query) => + CreateV2ViewExec( +catalog = catalog, +ident = ident, +queryText = queryText, +columnAliases = Seq.empty, +columnComments = Seq.empty, +queryColumnNames = query.schema.fieldNames, +viewSchema = query.schema, +comment = None, +properties = Map.empty, +allowExisting = false, +replace = true) :: Nil + +case SetViewProperties(ResolvedV2View(catalog, ident, _), properties) => + val changes = properties.map { +case (property, value) => ViewChange.setProperty(property, value) + }.toSeq + AlterV2ViewExec(catalog, ident, changes) :: Nil + +case UnsetViewProperties(ResolvedV2View(catalog, ident, view), propertyKeys, ifExists) => Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.3: Add RemoveDanglingDeletes action [iceberg]
zinking commented on code in PR #6581: URL: https://github.com/apache/iceberg/pull/6581#discussion_r1469455133 ## api/src/main/java/org/apache/iceberg/DeleteFiles.java: ## @@ -55,6 +55,17 @@ default DeleteFiles deleteFile(DataFile file) { return this; } + /** + * Delete a file tracked by a {@link DeleteFile} from the underlying table. + * + * @param file a DeleteFile to remove from the table + * @return this for method chaining + */ + default DeleteFiles deleteFile(DeleteFile file) { +deleteFile(file.path()); Review Comment: This will not work, and needs to be changed to use the delete file specific API. and I'm surprised that UT didn't catch it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Add support for describing/showing views [iceberg]
nastra commented on code in PR #9513: URL: https://github.com/apache/iceberg/pull/9513#discussion_r1469477447 ## spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteViewCommands.scala: ## @@ -60,17 +65,27 @@ case class RewriteViewCommands(spark: SparkSession) extends Rule[LogicalPlan] wi properties = properties, allowExisting = allowExisting, replace = replace) + +case a@AlterViewAs(ResolvedV2View(_, ident, _), _, query) => + val q = CTESubstitution.apply(query) + verifyTemporaryObjectsDontExist(ident, q) + a.copy(query = q) + +case u@UnresolvedView(IsIcebergView(resolved), _, _, _) => + loadView(resolved.catalog, resolved.identifier) +.map(v => ResolvedV2View(resolved.catalog.asViewCatalog, resolved.identifier, v)) +.getOrElse(u) + +case ShowViews(UnresolvedNamespace(ns), pattern, output) + if isViewCatalog(spark.sessionState.catalogManager.currentCatalog) => + ShowIcebergViews(UnresolvedNamespace(ns), pattern, output) Review Comment: I've updated this to use `CatalogAndNamespace` for this particular case. For the case where no namespace is given, we need to match `ShowViews(UnresolvedNamespace(Seq()), pattern, output)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Add support for describing/showing views [iceberg]
nastra commented on code in PR #9513: URL: https://github.com/apache/iceberg/pull/9513#discussion_r1469484505 ## spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowCreateV2ViewExec.scala: ## @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.catalyst.util.escapeSingleQuotedString +import org.apache.spark.sql.connector.catalog.View +import org.apache.spark.sql.connector.catalog.ViewCatalog +import org.apache.spark.sql.execution.LeafExecNode +import scala.collection.JavaConverters._ + +case class ShowCreateV2ViewExec(output: Seq[Attribute], view: View) + extends V2CommandExec with LeafExecNode { + + override protected def run(): Seq[InternalRow] = { +val builder = new StringBuilder +builder ++= s"CREATE VIEW ${view.name} " +showColumns(view, builder) +showComment(view, builder) +showProperties(view, builder) +builder ++= s"AS\n${view.query}\n" + +Seq(toCatalystRow(builder.toString)) + } + + private def showColumns(view: View, builder: StringBuilder): Unit = { +val columns = view.schema().fields Review Comment: I ran into this case when I was implementing this so I compared this with how `SHOW CREATE TABLE ` does it. [Here](https://github.com/apache/spark/blob/9088249f02543c549e0b330812c77edcdbbeb795/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowCreateTableExec.scala#L64-L67) it can be seen that it just takes the latest schema of the table and uses it for the output. So should we be aligning `SHOW CREATE VIEW` with `SHOW CREATE TABLE` here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]
ilyasahsan123 commented on PR #9565: URL: https://github.com/apache/iceberg/pull/9565#issuecomment-1914620937 hi @nastra , I've addressed your suggestion. Could you please take another look? Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
syun64 commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469014546 ## tests/catalog/test_base.py: ## @@ -330,6 +333,34 @@ def test_create_table(catalog: InMemoryCatalog) -> None: assert catalog.load_table(TEST_TABLE_IDENTIFIER) == table +@pytest.mark.parametrize( +"schema_fixture,expected_fixture", +[ +("pyarrow_schema_simple_without_ids", "iceberg_schema_simple"), +("iceberg_schema_simple", "iceberg_schema_simple"), +("iceberg_schema_nested", "iceberg_schema_nested"), +("pyarrow_schema_nested_without_ids", "iceberg_schema_nested"), +], Review Comment: Thank you for the suggestion! :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]
nastra commented on code in PR #9565: URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469628869 ## flink/v1.18/build.gradle: ## @@ -22,6 +22,10 @@ String scalaVersion = System.getProperty("scalaVersion") != null ? System.getPro project(":iceberg-flink:iceberg-flink-${flinkMajorVersion}") { + test { Review Comment: this is already set in L127, so it's not needed here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]
nastra commented on code in PR #9565: URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469634197 ## flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/TestRowDataWrapper.java: ## @@ -49,12 +49,12 @@ public void testTime() { return; } -Assertions.assertThat(actual).isNotNull(); -Assertions.assertThat(expected).isNotNull(); +assertThat(actual).isNotNull(); +assertThat(expected).isNotNull(); int expectedMilliseconds = (int) ((long) expected / 1000_000); int actualMilliseconds = (int) ((long) actual / 1000_000); -Assert.assertEquals(message, expectedMilliseconds, actualMilliseconds); + assertThat(expectedMilliseconds).as(message).isEqualTo(actualMilliseconds); Review Comment: the actual/expected is the wrong way around. It should be `assertThat(actualMilliseconds).isEqualTo(expectedMilliseconds)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]
nastra commented on code in PR #9565: URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469635766 ## flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/TestRowDataWrapper.java: ## @@ -75,8 +75,8 @@ protected void generateAndValidate(Schema schema, RecordWrapperTest.AssertMethod StructLikeWrapper actualWrapper = StructLikeWrapper.forType(schema.asStruct()); StructLikeWrapper expectedWrapper = StructLikeWrapper.forType(schema.asStruct()); for (int i = 0; i < numRecords; i++) { - Assert.assertTrue("Should have more records", actual.hasNext()); - Assert.assertTrue("Should have more RowData", expected.hasNext()); + assertThat(actual.hasNext()).as("Should have more records").isTrue(); Review Comment: ```suggestion assertThat(actual).as("Should have more records").hasNext(); ``` same for the other places that use a similar pattern. We want to avoid using isTrue/isFalse as much as possible, because we want to get rich context from AssertJ when the assertion ever fails -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]
nastra commented on code in PR #9565: URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469638343 ## flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/data/TestStructRowData.java: ## @@ -22,7 +22,7 @@ import org.apache.iceberg.flink.DataGenerator; import org.apache.iceberg.flink.DataGenerators; import org.apache.iceberg.flink.TestHelpers; -import org.junit.Test; +import org.junit.jupiter.api.Test; public class TestStructRowData { Review Comment: why not also include the other files from the `data` package in this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]
nastra commented on code in PR #9565: URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469640535 ## flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/sink/TestAvroGenericRecordToRowDataMapper.java: ## @@ -18,10 +18,11 @@ */ package org.apache.iceberg.flink.sink; +import static org.junit.jupiter.api.Assertions.assertEquals; Review Comment: wrong import. We should be using `AssertJ` assertions instead of the JUnit ones -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]
nastra commented on code in PR #9565: URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469642860 ## flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/sink/TestAvroGenericRecordToRowDataMapper.java: ## @@ -18,10 +18,11 @@ */ package org.apache.iceberg.flink.sink; Review Comment: I would have expected to include other test files from the `sink` package here is well, but just picking 2 files seems quite random to me. I'd suggest to focus on one or more packages per PR (depending on how big the changes are) instead of just randomly picking files across the codebase -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
syun64 commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469643574 ## pyiceberg/schema.py: ## @@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: Union[Schema, IcebergType], next_id: class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]): """Traverses the schema and assigns monotonically increasing ids.""" -reserved_ids: Dict[int, int] +old_id_to_new_id: Dict[int, int] Review Comment: That makes sense - I'm having an aha moment here, and just wanted to summarize my thoughts... I think we could include the work to enable passing the baseSchema as an input argument to the visitor to [REPLACE TABLE](https://github.com/apache/iceberg-python/issues/281#issuecomment-1905187291) support since that's the function that requires fetching the original IDs from the baseSchema. CREATE TABLE, which is the only operation we've supported so far doesn't need to compare any IDs against a baseSchema. In the java code, [idFor](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/AssignFreshIds.java#L51) function uses the existing ID from the baseSchema by searching it by name if it exists. If it doesn't exist, it assigns the new ID from the counter. Another difference I see with AssignFreshIds in the java code is that it doesn't handle the reassignment of the identifier_field_ids within the visitor. Instead, it returns the schema object and searches the [new identifier field IDs](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/TypeUtil.java#L262) by name. Is there a reason we decided to step away from this approach in the Python implementation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]
nastra commented on code in PR #9565: URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469643916 ## flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/sink/TestRowDataPartitionKey.java: ## @@ -91,10 +92,10 @@ public void testNullPartitionValue() { for (RowData row : rows) { PartitionKey partitionKey = new PartitionKey(spec, schema); partitionKey.partition(rowWrapper.wrap(row)); - Assert.assertEquals(partitionKey.size(), 1); + assertThat(partitionKey.size()).isEqualTo(1); String expectedStr = row.isNullAt(1) ? null : row.getString(1).toString(); - Assert.assertEquals(expectedStr, partitionKey.get(0, String.class)); + assertThat(expectedStr).isEqualTo(partitionKey.get(0, String.class)); Review Comment: the ordering of actual/expected needs to be fixed here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink v1.16: Switch Flink v1.16 tests to Junit5 (Part 1) [iceberg]
nastra commented on code in PR #9565: URL: https://github.com/apache/iceberg/pull/9565#discussion_r1469646130 ## flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/reader/TestArrayBatchRecords.java: ## @@ -50,19 +51,18 @@ private void testArray( fileOffset, startingRecordOffset); -Assert.assertEquals(splitId, recordsWithSplitIds.nextSplit()); +assertThat(splitId).isEqualTo(recordsWithSplitIds.nextSplit()); for (int i = 0; i < numberOfRecords; i++) { RecordAndPosition recAndPos = recordsWithSplitIds.nextRecordFromSplit(); - Assert.assertEquals(elements[i], recAndPos.record()); - Assert.assertEquals(fileOffset, recAndPos.fileOffset()); - // recordOffset points to the position after this one - Assert.assertEquals(startingRecordOffset + i + 1, recAndPos.recordOffset()); + assertThat(elements[i]).isEqualTo(recAndPos.record()); + assertThat(fileOffset).isEqualTo(recAndPos.fileOffset()); Review Comment: I think actual/expected are in the wrong order here as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
syun64 commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469674294 ## pyiceberg/schema.py: ## @@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: Union[Schema, IcebergType], next_id: class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]): """Traverses the schema and assigns monotonically increasing ids.""" -reserved_ids: Dict[int, int] +old_id_to_new_id: Dict[int, int] def __init__(self, next_id_func: Optional[Callable[[], int]] = None) -> None: -self.reserved_ids = {} +self.old_id_to_new_id = {} counter = itertools.count(1) self.next_id_func = next_id_func if next_id_func is not None else lambda: next(counter) -def _get_and_increment(self) -> int: -return self.next_id_func() +def _get_and_increment(self, current_id: int) -> int: +new_id = self.next_id_func() +self.old_id_to_new_id[current_id] = new_id +return new_id def schema(self, schema: Schema, struct_result: Callable[[], StructType]) -> Schema: -# First we keep the original identifier_field_ids here, we remap afterwards -fields = struct_result().fields -return Schema(*fields, identifier_field_ids=[self.reserved_ids[field_id] for field_id in schema.identifier_field_ids]) +return Schema( +*struct_result().fields, +identifier_field_ids=[self.old_id_to_new_id[field_id] for field_id in schema.identifier_field_ids], +) def struct(self, struct: StructType, field_results: List[Callable[[], IcebergType]]) -> StructType: -# assign IDs for this struct's fields first -self.reserved_ids.update({field.field_id: self._get_and_increment() for field in struct.fields}) -return StructType(*[field() for field in field_results]) +new_ids = [self._get_and_increment(field.field_id) for field in struct.fields] +new_fields = [] +for field_id, field, field_type in zip(new_ids, struct.fields, field_results): +new_fields.append( +NestedField( +field_id=field_id, +name=field.name, +field_type=field_type(), +required=field.required, +doc=field.doc, +) Review Comment: Yes, I realized that too... I think the refactored behavior actually [mirrors the Java code](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/AssignFreshIds.java#L102) more closely now in how it takes the Type, and then creates the NestedField with the result of the field level Callable within the struct function call in order to assign the IDs first to the fields, and then to its children. I think preserving this approach allows us to avoid introducing the new constraint of requiring the field IDs to be unique, and limits that restriction to just the [correctness in the task of refreshing the identifier_field_ids](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/TypeUtil.java#L262), which requires referring to the field Name by index as well as by name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Support altering views [iceberg]
nastra commented on code in PR #9510: URL: https://github.com/apache/iceberg/pull/9510#discussion_r1469684751 ## spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckViews.scala: ## @@ -36,6 +38,9 @@ object CheckViews extends (LogicalPlan => Unit) { verifyColumnCount(ident, columnAliases, query) SchemaUtils.checkColumnNameDuplication(query.schema.fieldNames, SQLConf.get.resolver) +case AlterViewAs(ResolvedV2View(_, _, _), _, query) => + SchemaUtils.checkColumnNameDuplication(query.schema.fieldNames, SQLConf.get.resolver) Review Comment: I agree that this is a weird case and I also had a different expectation when initially running into this. That's why I added the `alterViewWithUpdatedQueryColumns()` test to make this behavior explicit. I also looking into how V1 views behave here and they do the same thing (aka losing the column comments from the original CREATE): ``` spark-sql (default)> create view iceberg1.v10 (x COMMENT 'comment 1', y COMMENT 'comment 2') AS SELECT count(id), zip from iceberg1.foo group by zip; spark-sql (default)> show create table iceberg1.v10; CREATE VIEW iceberg1.v10 ( x COMMENT 'comment 1', y COMMENT 'comment 2') TBLPROPERTIES ( 'transient_lastDdlTime' = '1706538575') AS SELECT count(id) AS cnt, zip from iceberg1.foo group by zip Time taken: 0.055 seconds, Fetched 1 row(s) spark-sql (default)> describe extended iceberg1.v10; xbigint comment 1 yint comment 2 # Detailed Table Information Catalog spark_catalog Database iceberg1 Tablev10 Ownernastra Created Time Mon Jan 29 15:29:35 CET 2024 Last Access UNKNOWN Created By Spark 3.5.0 Type VIEW View TextSELECT count(id) AS cnt, zip from iceberg1.foo group by zip View Original Text SELECT count(id) AS cnt, zip from iceberg1.foo group by zip View Catalog and Namespace spark_catalog.default View Query Output Columns[cnt, zip] Table Properties [transient_lastDdlTime=1706538575] Serde Libraryorg.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties [serialization.format=1] Time taken: 0.063 seconds, Fetched 21 row(s) ``` ``` spark-sql (default)> alter view iceberg1.v10 AS SELECT count(zip) AS cnt, zip from iceberg1.foo group by zip; Time taken: 0.213 seconds spark-sql (default)> show create table iceberg1.v10; CREATE VIEW iceberg1.v10 ( cnt, zip) TBLPROPERTIES ( 'transient_lastDdlTime' = '1706538634') AS SELECT count(zip) AS cnt, zip from iceberg1.foo group by zip Time taken: 0.029 seconds, Fetched 1 row(s) spark-sql (default)> describe extended iceberg1.v10; cnt bigint zip int # Detailed Table Information Catalog spark_catalog Database iceberg1 Tablev10 Ownernastra Created Time Mon Jan 29 15:29:35 CET 2024 Last Access UNKNOWN Created By Spark 3.5.0 Type VIEW View TextSELECT count(zip) AS cnt, zip from iceberg1.foo group by zip View Original Text SELECT count(zip) AS cnt, zip from iceberg1.foo group by zip View Catalog and Namespace spark_catalog.default View Query Output Columns[cnt, zip]
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
syun64 commented on PR #305: URL: https://github.com/apache/iceberg-python/pull/305#issuecomment-1914837022 Hi @Fokko - for some reason I don't see an option to respond to this specific Review comment, so I'm just replying here instead: > I would love to get this in with the 0.6.0 release to simplify the creation of tables. Sounds good 👍 I'll try my best to bring this review to a consensus in the next few days. We already have a working version, so we just need to agree on the final details 😄 >That's an okay approach, as long as the visitor to do this is hidden inside the package. We should not expose setting -1 field IDs to the outside. > > What I like about the current implementation is that the visitor can be used on its own. Converting a field with all -1 IDs doesn't provide much value on its own. I'm 🆗 with both approach, but I have a preference for the '-1' approach... and here's my current thought. Both approaches will work, and in both approaches, we are introducing an intermediate state of the Iceberg Schema that should never be exposed. But what I like about this approach is that we are creating an intermediate Iceberg Schema that is very visibly wrong, that we could use to check the validity of the schema before we allow the erroneous intermediate state to materialize a lasting result. An example could be to introduce a new sanity check before writing the individual files, or before making a schema update, that checks that no two IDs are the same, or that the IDs are greater than 0. Hence, I think the fact that the intermediate Iceberg Schema is "seemingly" correct actually feels more like a downside to me... I'm curious to hear if that argument resonates with you or the rest of the group. Hopefully it will be the last discussion item before we settle on the final approach 🎉🎈 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
syun64 commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469706761 ## pyiceberg/io/pyarrow.py: ## @@ -906,6 +932,21 @@ def after_map_value(self, element: pa.Field) -> None: self._field_names.pop() +class _ConvertToIcebergWithNoIds(_ConvertToIceberg): Review Comment: Thank you for the suggestion :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
syun64 commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469710203 ## pyiceberg/io/pyarrow.py: ## @@ -906,6 +986,76 @@ def after_map_value(self, element: pa.Field) -> None: self._field_names.pop() +class _ConvertToIcebergWithFreshIds(PreOrderPyArrowSchemaVisitor[Union[IcebergType, Schema]]): Review Comment: > I would love to get this in with the 0.6.0 release to simplify the creation of tables. Sounds good 👍 I'll try my best to bring this review to a consensus in the next few days. We already have a working version, so we just need to agree on the final details 😄 >That's an okay approach, as long as the visitor to do this is hidden inside the package. We should not expose setting -1 field IDs to the outside. > > What I like about the current implementation is that the visitor can be used on its own. Converting a field with all -1 IDs doesn't provide much value on its own. I'm 🆗 with both approach, but I have a preference for the '-1' approach... and here's my current thought. Both approaches will work, and in both approaches, we are introducing an intermediate state of the Iceberg Schema that should never be exposed. But what I like about this approach is that we are creating an intermediate Iceberg Schema that is very visibly wrong, that we could use to check the validity of the schema before we allow the erroneous intermediate state to materialize a lasting result. An example could be to introduce a new sanity check before writing the individual files, or before making a schema update, that checks that no two IDs are the same, or that the IDs are greater than 0. Hence, I think the fact that the intermediate Iceberg Schema is "seemingly" correct actually feels more like a downside to me... I'm curious to hear if that argument resonates with you or the rest of the group. Hopefully it will be the last discussion item before we settle on the final approach 🎉🎈 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
syun64 commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1469711334 ## pyiceberg/io/pyarrow.py: ## @@ -761,6 +761,32 @@ def primitive(self, primitive: pa.DataType) -> T: """Visit a primitive type.""" +class PreOrderPyArrowSchemaVisitor(Generic[T], ABC): Review Comment: Good catch - I missed this one. I'll remove this Visitor as its no longer in use -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Support altering views [iceberg]
nastra commented on code in PR #9510: URL: https://github.com/apache/iceberg/pull/9510#discussion_r1469766873 ## spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteViewCommands.scala: ## @@ -60,17 +63,23 @@ case class RewriteViewCommands(spark: SparkSession) extends Rule[LogicalPlan] wi properties = properties, allowExisting = allowExisting, replace = replace) + +case a@AlterViewAs(ResolvedV2View(_, ident, _), _, query) => Review Comment: if we want to go this route, then we'd also need to add the same handling for `SetViewProperties(UnresolvedView(...))` / `UnsetViewProperties(UnresolvedView(...))`, hence it seemed better to me to just handle `case u@UnresolvedView(ResolvedView(resolved), _, _, _)` separately. I'm fine either way, so let me know if you'd still like me to switch to the approach you proposed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] docs: Add community page [iceberg-python]
Fokko opened a new pull request, #315: URL: https://github.com/apache/iceberg-python/pull/315 Analog to the general Iceberg page. I also think it is good to revive the PyIceberg sync. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] docs: Remove feature support page [iceberg-python]
Fokko opened a new pull request, #316: URL: https://github.com/apache/iceberg-python/pull/316 With the 0.6.0 release we'll check the last box, and we have feature parity with Java according to this table. I would prefer creating tickets on Github to point out the gaps rather than this page. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Added error handling and default logic for Flink version detection [iceberg]
gjacoby126 commented on PR #9452: URL: https://github.com/apache/iceberg/pull/9452#issuecomment-1915203994 @nastra @pvary , just checking back in. I was wondering if there were any more changes you'd like to see to this patch, or if it's ready for commit? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]
epgif commented on code in PR #9008: URL: https://github.com/apache/iceberg/pull/9008#discussion_r1469960212 ## api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java: ## @@ -564,14 +569,26 @@ private static String sanitizeDate(int days, int today) { return "(date)"; } - private static String sanitizeTimestamp(long micros, long now) { -String isPast = now > micros ? "ago" : "from-now"; -long diff = Math.abs(now - micros); -if (diff < FIVE_MINUTES_IN_MICROS) { + private static String sanitizeTimestamp(ChronoUnit unit, long timeUnits, long nowMillis) { +long timeMillis; +switch (unit) { + case MICROS: +timeMillis = DateTimeUtil.microsToMillis(timeUnits); +break; + case NANOS: +timeMillis = DateTimeUtil.nanosToMillis(timeUnits); Review Comment: I don't see how. See line 521 of `sanitize` for example. ## api/src/main/java/org/apache/iceberg/expressions/Literals.java: ## @@ -501,18 +531,22 @@ public Literal to(Type type) { return (Literal) new TimeLiteral(timeMicros); case TIMESTAMP: - if (((Types.TimestampType) type).shouldAdjustToUTC()) { -long timestampMicros = -ChronoUnit.MICROS.between( -EPOCH, OffsetDateTime.parse(value(), DateTimeFormatter.ISO_DATE_TIME)); -return (Literal) new TimestampLiteral(timestampMicros); + TimestampType tsType = (TimestampType) type; + if (tsType.shouldAdjustToUTC()) { +long timestampUnits = +tsType +.unit() +.between(EPOCH, OffsetDateTime.parse(value(), DateTimeFormatter.ISO_DATE_TIME)); Review Comment: Maybe. Do you want that done here, or in a follow-up pull request? ## api/src/main/java/org/apache/iceberg/expressions/Literals.java: ## @@ -426,16 +428,40 @@ protected Type.TypeID typeId() { } static class TimestampLiteral extends ComparableLiteral { -TimestampLiteral(Long value) { +private final ChronoUnit unit; + +TimestampLiteral(ChronoUnit unit, Long value) { super(value); + this.unit = unit; } @Override @SuppressWarnings("unchecked") public Literal to(Type type) { switch (type.typeId()) { case TIMESTAMP: - return (Literal) this; + ChronoUnit toUnit = ((TimestampType) type).unit(); + switch (unit) { +case MICROS: + switch (toUnit) { +case MICROS: + return (Literal) this; +case NANOS: + return (Literal) + new TimestampLiteral(unit, DateTimeUtil.microsToNanos(value())); + } + break; +case NANOS: + switch (toUnit) { +case MICROS: + return (Literal) + new TimestampLiteral(unit, DateTimeUtil.nanosToMicros(value())); +case NANOS: + return (Literal) this; + } + break; + } + break; Review Comment: I'll add that. ## api/src/main/java/org/apache/iceberg/types/Types.java: ## @@ -46,8 +47,10 @@ private Types() {} .put(DoubleType.get().toString(), DoubleType.get()) .put(DateType.get().toString(), DateType.get()) .put(TimeType.get().toString(), TimeType.get()) - .put(TimestampType.withZone().toString(), TimestampType.withZone()) - .put(TimestampType.withoutZone().toString(), TimestampType.withoutZone()) + .put(TimestampType.microsWithZone().toString(), TimestampType.microsWithZone()) + .put(TimestampType.microsWithoutZone().toString(), TimestampType.microsWithoutZone()) Review Comment: Are you just suggesting we leave lines 49 and 50 alone, i.e. using the methods we want to deprecate? Functionally there's no difference, but I'd prefer not to use the deprecated forms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]
epgif commented on code in PR #9008: URL: https://github.com/apache/iceberg/pull/9008#discussion_r1469960791 ## api/src/main/java/org/apache/iceberg/types/Types.java: ## @@ -205,38 +208,74 @@ public String toString() { } public static class TimestampType extends PrimitiveType { -private static final TimestampType INSTANCE_WITH_ZONE = new TimestampType(true); -private static final TimestampType INSTANCE_WITHOUT_ZONE = new TimestampType(false); +private static final TimestampType INSTANCE_MICROS_WITH_ZONE = +new TimestampType(true, ChronoUnit.MICROS); +private static final TimestampType INSTANCE_MICROS_WITHOUT_ZONE = +new TimestampType(false, ChronoUnit.MICROS); +private static final TimestampType INSTANCE_NANOS_WITH_ZONE = +new TimestampType(true, ChronoUnit.NANOS); +private static final TimestampType INSTANCE_NANOS_WITHOUT_ZONE = +new TimestampType(false, ChronoUnit.NANOS); + +/** @deprecated use {@link #microsWithZone()} instead. */ +@Deprecated Review Comment: Happy to add that. Are we certain it will be in "2.0" or should I say something like "likely removed in 2.0"? ## api/src/main/java/org/apache/iceberg/types/Types.java: ## @@ -205,38 +208,74 @@ public String toString() { } public static class TimestampType extends PrimitiveType { -private static final TimestampType INSTANCE_WITH_ZONE = new TimestampType(true); -private static final TimestampType INSTANCE_WITHOUT_ZONE = new TimestampType(false); +private static final TimestampType INSTANCE_MICROS_WITH_ZONE = +new TimestampType(true, ChronoUnit.MICROS); +private static final TimestampType INSTANCE_MICROS_WITHOUT_ZONE = +new TimestampType(false, ChronoUnit.MICROS); +private static final TimestampType INSTANCE_NANOS_WITH_ZONE = +new TimestampType(true, ChronoUnit.NANOS); +private static final TimestampType INSTANCE_NANOS_WITHOUT_ZONE = +new TimestampType(false, ChronoUnit.NANOS); + +/** @deprecated use {@link #microsWithZone()} instead. */ +@Deprecated public static TimestampType withZone() { - return INSTANCE_WITH_ZONE; + return INSTANCE_MICROS_WITH_ZONE; } +/** @deprecated use {@link #microsWithoutZone()} instead. */ +@Deprecated public static TimestampType withoutZone() { - return INSTANCE_WITHOUT_ZONE; + return INSTANCE_MICROS_WITHOUT_ZONE; +} + +public static TimestampType microsWithZone() { + return INSTANCE_MICROS_WITH_ZONE; +} + +public static TimestampType microsWithoutZone() { + return INSTANCE_MICROS_WITHOUT_ZONE; +} + +public static TimestampType nanosWithZone() { + return INSTANCE_NANOS_WITH_ZONE; +} + +public static TimestampType nanosWithoutZone() { + return INSTANCE_NANOS_WITHOUT_ZONE; } private final boolean adjustToUTC; +private final ChronoUnit unit; -private TimestampType(boolean adjustToUTC) { +private TimestampType(boolean adjustToUTC, ChronoUnit unit) { this.adjustToUTC = adjustToUTC; + this.unit = unit; } public boolean shouldAdjustToUTC() { return adjustToUTC; } +public ChronoUnit unit() { Review Comment: We use some ChronoUnit functionality, e.g. the `between` method. I can introduce our own enum wrapping a ChonoUnit and delegating that method, if you prefer. ## api/src/test/java/org/apache/iceberg/TestPartitionPaths.java: ## @@ -42,8 +42,8 @@ public void testPartitionPath() { Transform bucket = Transforms.bucket(10); Literal ts = - Literal.of("2017-12-01T10:12:55.038194").to(Types.TimestampType.withoutZone()); -Object tsHour = hour.bind(Types.TimestampType.withoutZone()).apply(ts.value()); + Literal.of("2017-12-01T10:12:55.038194").to(Types.TimestampType.microsWithoutZone()); +Object tsHour = hour.bind(Types.TimestampType.microsWithoutZone()).apply(ts.value()); Review Comment: Will do. ## api/src/test/java/org/apache/iceberg/expressions/TestStringLiteralConversions.java: ## @@ -101,7 +101,7 @@ public void testStringToTimestampLiteral() { // Timestamp with explicit UTC offset, +00:00 Literal timestampStr = Literal.of("2017-08-18T14:21:01.919+00:00"); -Literal timestamp = timestampStr.to(Types.TimestampType.withZone()); +Literal timestamp = timestampStr.to(Types.TimestampType.microsWithZone()); Review Comment: Will do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.or
Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]
epgif commented on code in PR #9008: URL: https://github.com/apache/iceberg/pull/9008#discussion_r1469961074 ## api/src/test/java/org/apache/iceberg/types/TestConversions.java: ## @@ -93,15 +93,26 @@ public void testByteBufferConversions() { assertThat(Literal.of(1L).to(TimeType.get()).toByteBuffer().array()) .isEqualTo(new byte[] {16, 39, 0, 0, 0, 0, 0, 0}); -// timestamps are stored as microseconds from 1970-01-01 00:00:00.00 in an 8-byte +// timestamps are stored as micro|nanoseconds from 1970-01-01 00:00:00 in an +// 8-byte Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Add View Support to Trino [iceberg]
maxpoulain commented on issue #7939: URL: https://github.com/apache/iceberg/issues/7939#issuecomment-1915233640 Hi ! Is there any news on this subject ? Is this issue was moved to Trino project ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.4: Fix writing of default values in CoW for rows with NULL columns which are unmatched [iceberg]
amogh-jahagirdar commented on PR #9556: URL: https://github.com/apache/iceberg/pull/9556#issuecomment-1915239004 > Just to be sure, can you set spark.sql.planChangeLog.level to info and execute the failing test to see what rule adds the default value behavior? It is still not clear how a wrong nullability info in MergeRows led to inserting default values. Sure, I've set that for before/after the fix (specifically the fix @rdblue mentioned here https://github.com/apache/iceberg/pull/9556#discussion_r1468932165). Logs Before: https://gist.github.com/amogh-jahagirdar/98ffeaa21e203423c7da0c98e132e9bd Logs After: https://gist.github.com/amogh-jahagirdar/6209e373b37e2d7a134df447ec49f8b5 I don't see anything differentiating in terms of the rules and their outputs. I can still explain why there's a default value behavior though based on what I saw the debugger. The Iceberg schema used for performing the write that gets built here SparkWriteBuilder#validateOrMergeWriteSchema: https://github.com/apache/iceberg/blob/54756b6f5c653be2ab271bfbe262e77109bf9608/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java#L120 That schema makes the fields *required* instead of optional, due to the nullability passed through in the Spark Writer API being incorrect. Then when the write is actually performed, we end up writing default values. I've updated this PR to follow the approach @rdblue https://github.com/apache/iceberg/pull/9556#discussion_r1468932165 which after going through the thread I believe does make sense, and is a better solution compared to the previous approach which just forces the matched output list to have the attributes for making the nullability check behave as expected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.4: Fix writing of default values in CoW for rows with NULL columns which are unmatched [iceberg]
amogh-jahagirdar commented on code in PR #9556: URL: https://github.com/apache/iceberg/pull/9556#discussion_r1469971672 ## spark/v3.4/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteMergeIntoTable.scala: ## @@ -214,6 +214,8 @@ object RewriteMergeIntoTable extends RewriteRowLevelIcebergCommand with Predicat val rowFromSourceAttr = resolveAttrRef(ROW_FROM_SOURCE_REF, joinPlan) val rowFromTargetAttr = resolveAttrRef(ROW_FROM_TARGET_REF, joinPlan) +// The output expression should retain read attributes for correctly determining nullability +val matchedOutputsWithAttrs = matchedActions.map(matchedActionOutput(_, metadataAttrs) :+ readAttrs) Review Comment: Thanks @rdblue appreciate the detailed thread with the reasoning, I followed along and it makes sense. I agree that adding the readAttrs to the notMatchedOutput makes more sense for having a `MergeRows` logical plan which is logically sound. I've updated the PR. @aokolnychyi let us know what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.4: Fix writing of default values in CoW for rows with NULL columns which are unmatched [iceberg]
amogh-jahagirdar commented on code in PR #9556: URL: https://github.com/apache/iceberg/pull/9556#discussion_r1469971672 ## spark/v3.4/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteMergeIntoTable.scala: ## @@ -214,6 +214,8 @@ object RewriteMergeIntoTable extends RewriteRowLevelIcebergCommand with Predicat val rowFromSourceAttr = resolveAttrRef(ROW_FROM_SOURCE_REF, joinPlan) val rowFromTargetAttr = resolveAttrRef(ROW_FROM_TARGET_REF, joinPlan) +// The output expression should retain read attributes for correctly determining nullability +val matchedOutputsWithAttrs = matchedActions.map(matchedActionOutput(_, metadataAttrs) :+ readAttrs) Review Comment: Thanks @rdblue appreciate the detailed thread with the reasoning, I followed along and it makes sense. I agree that adding the readAttrs to the notMatchedOutput makes more sense for having a `MergeRows` logical plan which is logically sound compared to the previous approach which more forces the nullability check to pass just by passing the attributes along with the matched output list. I've updated the PR. @aokolnychyi let us know what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Add View Support to Trino [iceberg]
amogh-jahagirdar closed issue #7939: Add View Support to Trino URL: https://github.com/apache/iceberg/issues/7939 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Add View Support to Trino [iceberg]
amogh-jahagirdar commented on issue #7939: URL: https://github.com/apache/iceberg/issues/7939#issuecomment-1915260354 Hey @maxpoulain yes, the tracking issue is in Trino https://github.com/trinodb/trino/issues/14120#issuecomment-1803568597 I also have a PR open there for View Support in Trino for the REST Catalog if you're interested: https://github.com/trinodb/trino/pull/19818 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] docs: Add community page [iceberg-python]
syun64 commented on code in PR #315: URL: https://github.com/apache/iceberg-python/pull/315#discussion_r1470043007 ## mkdocs/docs/community.md: ## @@ -0,0 +1,64 @@ +--- +hide: + - navigation +--- + + + +# Join the community + +Apache Iceberg tracks issues in GitHub and prefers to receive contributions as pull requests. + +Community discussions happen primarily on the [dev mailing list](https://lists.apache.org/list.html?d...@iceberg.apache.org), on [Apache Iceberg Slack workspace](https://join.slack.com/t/apache-iceberg/shared_invite/zt-287g3akar-K9Oe_En5j1UL7Y_Ikpai3A) in the #python channel, and on specific [GitHub issues](https://github.com/apache/iceberg-python/issues). + +## Iceberg Community Events + +The PyIceberg community sync is on each last Tuesday each month. To join, make sure to subscribe to the [iceberg-python-sync Google group](https://groups.google.com/g/iceberg-python-sync). Review Comment: super nit (feel free to take it or leave it): I'm wondering if this would read better ```suggestion The PyIceberg community sync is on the last Tuesday of every month. To join, make sure to subscribe to the [iceberg-python-sync Google group](https://groups.google.com/g/iceberg-python-sync). ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] docs: Add community page [iceberg-python]
syun64 commented on code in PR #315: URL: https://github.com/apache/iceberg-python/pull/315#discussion_r1470043007 ## mkdocs/docs/community.md: ## @@ -0,0 +1,64 @@ +--- +hide: + - navigation +--- + + + +# Join the community + +Apache Iceberg tracks issues in GitHub and prefers to receive contributions as pull requests. + +Community discussions happen primarily on the [dev mailing list](https://lists.apache.org/list.html?d...@iceberg.apache.org), on [Apache Iceberg Slack workspace](https://join.slack.com/t/apache-iceberg/shared_invite/zt-287g3akar-K9Oe_En5j1UL7Y_Ikpai3A) in the #python channel, and on specific [GitHub issues](https://github.com/apache/iceberg-python/issues). + +## Iceberg Community Events + +The PyIceberg community sync is on each last Tuesday each month. To join, make sure to subscribe to the [iceberg-python-sync Google group](https://groups.google.com/g/iceberg-python-sync). Review Comment: super nit (please feel free to take it or leave it): I'm wondering if this would read better ```suggestion The PyIceberg community sync is on the last Tuesday of every month. To join, make sure to subscribe to the [iceberg-python-sync Google group](https://groups.google.com/g/iceberg-python-sync). ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] docs: Add community page [iceberg-python]
Fokko commented on code in PR #315: URL: https://github.com/apache/iceberg-python/pull/315#discussion_r1470046433 ## mkdocs/docs/community.md: ## @@ -0,0 +1,64 @@ +--- +hide: + - navigation +--- + + + +# Join the community + +Apache Iceberg tracks issues in GitHub and prefers to receive contributions as pull requests. + +Community discussions happen primarily on the [dev mailing list](https://lists.apache.org/list.html?d...@iceberg.apache.org), on [Apache Iceberg Slack workspace](https://join.slack.com/t/apache-iceberg/shared_invite/zt-287g3akar-K9Oe_En5j1UL7Y_Ikpai3A) in the #python channel, and on specific [GitHub issues](https://github.com/apache/iceberg-python/issues). + +## Iceberg Community Events + +The PyIceberg community sync is on each last Tuesday each month. To join, make sure to subscribe to the [iceberg-python-sync Google group](https://groups.google.com/g/iceberg-python-sync). Review Comment: Thanks, good suggestion :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Small getting started guide on writes [iceberg-python]
HonahX commented on code in PR #311: URL: https://github.com/apache/iceberg-python/pull/311#discussion_r1470060936 ## mkdocs/docs/index.md: ## @@ -38,36 +38,129 @@ You can install the latest release version from pypi: pip install "pyiceberg[s3fs,hive]" ``` -Install it directly for GitHub (not recommended), but sometimes handy: +You can mix and match optional dependencies depending on your needs: + +| Key | Description: | +| | | +| hive | Support for the Hive metastore | +| glue | Support for AWS Glue | +| dynamodb | Support for AWS DynamoDB | +| sql-postgres | Support for SQL Catalog backed by Postgresql | +| sql-sqlite | Support for SQL Catalog backed by SQLite | +| pyarrow | PyArrow as a FileIO implementation to interact with the object store | +| pandas | Installs both PyArrow and Pandas | +| duckdb | Installs both PyArrow and DuckDB | +| ray | Installs PyArrow, Pandas, and Ray | +| s3fs | S3FS as a FileIO implementation to interact with the object store| +| adlfs| ADLFS as a FileIO implementation to interact with the object store | +| snappy | Support for snappy Avro compression | +| gcs | GCS as the FileIO implementation to interact with the object store | + +You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` to be able to fetch files from an object store. + +## Connecting to a catalog + +Iceberg leverages the [catalog to have one centralized place to organize the tables](https://iceberg.apache.org/catalog/). This can be a traditional Hive catalog to store your Iceberg tables next to the rest, a vendor solution like the AWS Glue catalog, or an implementation of Icebergs' own [REST protocol](https://github.com/apache/iceberg/tree/main/open-api). Checkout the [configuration](configuration.md) page to find all the configuration details. +## Write a PyArrow dataframe + +Let's take the Taxi dataset, and write this to an Iceberg table. + +First download one month of data: + +```shell +curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /tmp/yellow_tripdata_2023-01.parquet ``` -pip install "git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[s3fs]"; + +Load it into your PyArrow dataframe: + +```python +import pyarrow.parquet as pq + +df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet") ``` -Or clone the repository for local development: +Create a new Iceberg table: -```sh -git clone https://github.com/apache/iceberg-python.git -cd iceberg-python -pip3 install -e ".[s3fs,hive]" +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog("default") + +table = catalog.create_table( +"default.taxi_dataset", +schema=df.schema, # Blocked by https://github.com/apache/iceberg-python/pull/305 +) ``` -You can mix and match optional dependencies depending on your needs: +Append the dataframe to the table: + +```python +table.append(df) +len(table.scan().to_arrow()) +``` + +3066766 rows have been written to the table. + +Now generate a tip-per-mile feature to train the model on: + +```python +import pyarrow.compute as pc + +df = df.append_column("tip_per_mile", pc.divide(df["tip_amount"], df["trip_distance"])) +``` + +Evolve the schema of the table with the new column: + +```python +from pyiceberg.catalog import Catalog + +with table.update_schema() as update_schema: +# Blocked by https://github.com/apache/iceberg-python/pull/305 +update_schema.union_by_name(Catalog._convert_schema_if_needed(df.schema)) Review Comment: Thanks for the context! Yes, I think it's a good idea. Let's do it after #305 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Fix writing to local filesystem [iceberg-python]
Fokko merged PR #301: URL: https://github.com/apache/iceberg-python/pull/301 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] docs: Remove feature support page [iceberg-python]
Fokko merged PR #316: URL: https://github.com/apache/iceberg-python/pull/316 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] docs: Remove feature support page [iceberg-python]
Fokko commented on PR #316: URL: https://github.com/apache/iceberg-python/pull/316#issuecomment-1915379284 Thanks everyone for the quick review 🙌 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
Fokko commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470106203 ## pyiceberg/schema.py: ## @@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: Union[Schema, IcebergType], next_id: class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]): """Traverses the schema and assigns monotonically increasing ids.""" -reserved_ids: Dict[int, int] +old_id_to_new_id: Dict[int, int] Review Comment: > CREATE TABLE, which is the only operation we've supported so far doesn't need to compare any IDs against a baseSchema. In the java code, idFor function uses the existing ID from the baseSchema by searching it by name if it exists. If it doesn't exist, it assigns the new ID from the counter. Exactly, we could just pass in an empty schema (to avoid all the null-checks). > Instead, it returns the schema object and searches the new identifier field IDs by name. Is there a reason we decided to step away from this approach in the Python implementation? I don't think there was an explicit choice made there. The lookup by column name is correct, we do the same when evolving the schema: https://github.com/apache/iceberg-python/blob/a3e36838b05e4ac434ab394af758054b12d6/pyiceberg/table/__init__.py#L1404 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.4: Support creating views via SQL [iceberg]
amogh-jahagirdar merged PR #9580: URL: https://github.com/apache/iceberg/pull/9580 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
syun64 commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470130736 ## pyiceberg/schema.py: ## @@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: Union[Schema, IcebergType], next_id: class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]): """Traverses the schema and assigns monotonically increasing ids.""" -reserved_ids: Dict[int, int] +old_id_to_new_id: Dict[int, int] Review Comment: Ah interesting! Does this mean we could remove this from _SetFieldIDs, and instead handle the identifier_field_names update in the corresponding operation that uses it? That would simplify this visitor even more 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
syun64 commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470130736 ## pyiceberg/schema.py: ## @@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: Union[Schema, IcebergType], next_id: class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]): """Traverses the schema and assigns monotonically increasing ids.""" -reserved_ids: Dict[int, int] +old_id_to_new_id: Dict[int, int] Review Comment: Ah interesting! Does this mean we could remove this from _SetFieldIDs, and instead handle the identifier_field_names update in the corresponding operation that requires updating identifier_field_names to the new IDs? That would simplify this visitor even more 😄 Right now, only create_table uses this visitor and doesn't require identifier_field_names to be updated (it'll always be an empty list for a new table) If [UpdateSchema](https://github.com/apache/iceberg-python/blob/a3e36838b05e4ac434ab394af758054b12d6/pyiceberg/table/__init__.py#L1437) is already doing this within corresponding class that wraps the transaction, I think we could do the same in other operations that would need to do the same (like ReplaceTable) instead of keeping this feature within _SetFieldIDs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] InMemory Catalog Implementation [iceberg-python]
Fokko commented on code in PR #289: URL: https://github.com/apache/iceberg-python/pull/289#discussion_r1469989182 ## pyiceberg/catalog/__init__.py: ## @@ -137,12 +138,19 @@ def load_sql(name: str, conf: Properties) -> Catalog: raise NotInstalledError("SQLAlchemy support not installed: pip install 'pyiceberg[sql-postgres]'") from exc +def load_memory(name: str, conf: Properties) -> Catalog: +from pyiceberg.catalog.in_memory import InMemoryCatalog + +return InMemoryCatalog(name, **conf) + + AVAILABLE_CATALOGS: dict[CatalogType, Callable[[str, Properties], Catalog]] = { CatalogType.REST: load_rest, CatalogType.HIVE: load_hive, CatalogType.GLUE: load_glue, CatalogType.DYNAMODB: load_dynamodb, CatalogType.SQL: load_sql, +CatalogType.MEMORY: load_memory, Review Comment: Can you also add this one to the docs: https://py.iceberg.apache.org/configuration/ With a warning that this is just for testing purposes only. ## pyiceberg/catalog/in_memory.py: ## @@ -0,0 +1,222 @@ +import uuid +from typing import ( +Dict, +List, +Optional, +Set, +Union, +) + +from pyiceberg.catalog import ( +Catalog, +Identifier, +Properties, +PropertiesUpdateSummary, +) +from pyiceberg.exceptions import ( +NamespaceAlreadyExistsError, +NamespaceNotEmptyError, +NoSuchNamespaceError, +NoSuchTableError, +TableAlreadyExistsError, +) +from pyiceberg.io import WAREHOUSE +from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec +from pyiceberg.schema import Schema +from pyiceberg.table import ( +CommitTableRequest, +CommitTableResponse, +Table, +update_table_metadata, +) +from pyiceberg.table.metadata import new_table_metadata +from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder +from pyiceberg.typedef import EMPTY_DICT + +DEFAULT_WAREHOUSE_LOCATION = "file:///tmp/warehouse" + + +class InMemoryCatalog(Catalog): +"""An in-memory catalog implementation.""" + +__tables: Dict[Identifier, Table] +__namespaces: Dict[Identifier, Properties] + +def __init__(self, name: str, **properties: str) -> None: +super().__init__(name, **properties) +self.__tables = {} +self.__namespaces = {} +self._warehouse_location = properties.get(WAREHOUSE, None) or DEFAULT_WAREHOUSE_LOCATION + +def create_table( +self, +identifier: Union[str, Identifier], +schema: Schema, +location: Optional[str] = None, +partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC, +sort_order: SortOrder = UNSORTED_SORT_ORDER, +properties: Properties = EMPTY_DICT, +table_uuid: Optional[uuid.UUID] = None, +) -> Table: +identifier = Catalog.identifier_to_tuple(identifier) +namespace = Catalog.namespace_from(identifier) + +if identifier in self.__tables: +raise TableAlreadyExistsError(f"Table already exists: {identifier}") +else: +if namespace not in self.__namespaces: +self.__namespaces[namespace] = {} + +if not location: +location = f'{self._warehouse_location}/{"/".join(identifier)}' + +metadata_location = f'{self._warehouse_location}/{"/".join(identifier)}/metadata/metadata.json' Review Comment: It looks like we don't write the metadata here, but we write it below at the `_commit` method ## pyiceberg/catalog/in_memory.py: ## @@ -0,0 +1,222 @@ +import uuid +from typing import ( +Dict, +List, +Optional, +Set, +Union, +) + +from pyiceberg.catalog import ( +Catalog, +Identifier, +Properties, +PropertiesUpdateSummary, +) +from pyiceberg.exceptions import ( +NamespaceAlreadyExistsError, +NamespaceNotEmptyError, +NoSuchNamespaceError, +NoSuchTableError, +TableAlreadyExistsError, +) +from pyiceberg.io import WAREHOUSE +from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec +from pyiceberg.schema import Schema +from pyiceberg.table import ( +CommitTableRequest, +CommitTableResponse, +Table, +update_table_metadata, +) +from pyiceberg.table.metadata import new_table_metadata +from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder +from pyiceberg.typedef import EMPTY_DICT + +DEFAULT_WAREHOUSE_LOCATION = "file:///tmp/warehouse" + + +class InMemoryCatalog(Catalog): +"""An in-memory catalog implementation.""" + +__tables: Dict[Identifier, Table] +__namespaces: Dict[Identifier, Properties] + +def __init__(self, name: str, **properties: str) -> None: +super().__init__(name, **properties) +self.__tables = {} +self.__namespaces = {} +self._warehouse_location = properties.get(WAREHOUSE, None) or DEFAULT_WAREHOUSE_LOCATION + +def create_table( +self, +identifier: Union[str, Identifier
[PR] Build: Bump pydantic from 2.5.3 to 2.6.0 [iceberg-python]
dependabot[bot] opened a new pull request, #317: URL: https://github.com/apache/iceberg-python/pull/317 Bumps [pydantic](https://github.com/pydantic/pydantic) from 2.5.3 to 2.6.0. Release notes Sourced from https://github.com/pydantic/pydantic/releases";>pydantic's releases. v2.6.0 2024-01-29 v2.6.0 (2024-01-29) https://github.com/pydantic/pydantic/releases/tag/v2.6.0";>GitHub release The code released in v2.6.0 is practically identical to that of v2.6.0b1. What's Changed Packaging Check for email-validator version >= 2.0 by https://github.com/commonism";>@commonism in https://redirect.github.com/pydantic/pydantic/pull/6033";>#6033 Upgrade ruff target version to Python 3.8 by https://github.com/Elkiwa";>@Elkiwa in https://redirect.github.com/pydantic/pydantic/pull/8341";>#8341 Update to pydantic-extra-types==2.4.1 by https://github.com/yezz123";>@yezz123 in https://redirect.github.com/pydantic/pydantic/pull/8478";>#8478 Update to pyright==1.1.345 by https://github.com/Viicos";>@Viicos in https://redirect.github.com/pydantic/pydantic/pull/8453";>#8453 Update pydantic-core from 2.14.6 to 2.16.1, significant changes from these updates are described below, full changelog https://github.com/pydantic/pydantic-core/compare/v2.14.6...v2.16.1";>here New Features Add NatsDsn by https://github.com/ekeew";>@ekeew in https://redirect.github.com/pydantic/pydantic/pull/6874";>#6874 Add ConfigDict.ser_json_inf_nan by https://github.com/davidhewitt";>@davidhewitt in https://redirect.github.com/pydantic/pydantic/pull/8159";>#8159 Add types.OnErrorOmit by https://github.com/adriangb";>@adriangb in https://redirect.github.com/pydantic/pydantic/pull/8222";>#8222 Support AliasGenerator usage by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/8282";>#8282 Add Pydantic People Page to docs by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/8345";>#8345 Support -MM-DD datetime parsing by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/8404";>#8404 Added bits conversions to the ByteSize class https://redirect.github.com/pydantic/pydantic/issues/8415";>#8415 by https://github.com/luca-matei";>@luca-matei in https://redirect.github.com/pydantic/pydantic/pull/8507";>#8507 Enable json schema creation with type ByteSize by https://github.com/geospackle";>@geospackle in https://redirect.github.com/pydantic/pydantic/pull/8537";>#8537 Add eval_type_backport to handle union operator and builtin generic subscripting in older Pythons by https://github.com/alexmojaki";>@alexmojaki in https://redirect.github.com/pydantic/pydantic/pull/8209";>#8209 Add support for dataclass fields init by https://github.com/dmontagu";>@dmontagu in https://redirect.github.com/pydantic/pydantic/pull/8552";>#8552 Implement pickling for ValidationError by https://github.com/davidhewitt";>@davidhewitt in https://redirect.github.com/pydantic/pydantic-core/pull/1119";>pydantic/pydantic-core#1119 Add unified tuple validator that can handle "variadic" tuples via PEP-646 by https://github.com/dmontagu";>@dmontagu in https://redirect.github.com/pydantic/pydantic-core/pull/865";>pydantic/pydantic-core#865 Changes Drop Python3.7 support by https://github.com/hramezani";>@hramezani in https://redirect.github.com/pydantic/pydantic/pull/7188";>#7188 Drop Python 3.7, and PyPy 3.7 and 3.8 by https://github.com/davidhewitt";>@davidhewitt in https://redirect.github.com/pydantic/pydantic-core/pull/1129";>pydantic/pydantic-core#1129 Use positional-only self in BaseModel constructor, so no field name can ever conflict with it by https://github.com/ariebovenberg";>@ariebovenberg in https://redirect.github.com/pydantic/pydantic/pull/8072";>#8072 Make @validate_call return a function instead of a custom descriptor - fixes binding issue with inheritance and adds self/cls argument to validation errors by https://github.com/alexmojaki";>@alexmojaki in https://redirect.github.com/pydantic/pydantic/pull/8268";>#8268 Exclude BaseModel docstring from JSON schema description by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/8352";>#8352 Introducing classproperty decorator for model_computed_fields by https://github.com/Jocelyn-Gas";>@Jocelyn-Gas in https://redirect.github.com/pydantic/pydantic/pull/8437";>#8437 Explicitly raise an error if field names clashes with types by https://github.com/Viicos";>@Viicos in https://redirect.github.com/pydantic/pydantic/pull/8243";>#8243 Use stricter serializer for unions of simple types by https://github.com/alexdrydew";>@alexdrydew https://redirect.github.com/pydantic/pydantic-core/pull/1132";>pydantic/pydantic-core#1132 Performance Add Codspeed p
[PR] Build: Bump mkdocs-material from 9.5.5 to 9.5.6 [iceberg-python]
dependabot[bot] opened a new pull request, #318: URL: https://github.com/apache/iceberg-python/pull/318 Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.5 to 9.5.6. Release notes Sourced from https://github.com/squidfunk/mkdocs-material/releases";>mkdocs-material's releases. mkdocs-material-9.5.6 Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6700";>#6700: Missing styles for Mermaid.js labels with Markdown Changelog Sourced from https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG";>mkdocs-material's changelog. mkdocs-material-9.5.6+insiders-4.52.0 (2024-01-28) Added support for instant previews Fixed footnote tooltips positioning edge cases Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6703";>#6703: New tags plugin crashes on Windows mkdocs-material-9.5.6 (2024-01-28) Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6700";>#6700: Missing styles for Mermaid.js labels with Markdown mkdocs-material-9.5.5+insiders-4.51.0 (2024-01-24) Added support for footnote tooltips mkdocs-material-9.5.5 (2024-01-24) Updated Tagalog translations Updated Pillow to 10.2 to mitigate security vulnerabilities Improved resilience of instant navigation Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6687";>#6687: Updated Mermaid.js to version 10.7.0 (latest) Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6652";>#6652: Keyboard events in custom elements captured Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6582";>#6582: Instant navigation doesn't correctly handle alternate URLs Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6565";>#6565: Instant navigation doesn't allow for onclick handlers Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6345";>#6345: Instant navigation sometimes breaks browser back button Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6334";>#6334: Instant navigation doesn't correctly position anchors (Safari) Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6275";>#6275: Instant navigation doesn't correctly resolve after 404 Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6102";>#6102: Instant navigation reloads page on same link navigation mkdocs-material-9.5.4+insiders-4.50.0 (2024-01-19) Added configurable logging capabilities to privacy plugin mkdocs-material-9.5.4 (2024-01-15) Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6645";>#6645: Local storage with invalid value can break site Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6635";>#6635: Tags icons before default ignored if default is set mkdocs-material-9.5.3+insiders-4.49.2 (2024-01-09) Fixed missing attribute lists extension for tags plugin Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6627";>#6627: New tags plugin crashes on Python 3.8 mkdocs-material-9.5.3+insiders-4.49.1 (2024-01-07) Improved interop of new tags plugin with other plugins Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6594";>#6594: Tags plugin doesn't work with mkdocs-macros plugin Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6569";>#6569: Social plugin crashes if in different file system location mkdocs-material-9.5.3+insiders-4.49.0 (2023-12-29) ... (truncated) Commits https://github.com/squidfunk/mkdocs-material/commit/602673881952d10ea65c4ef98f76ceccabf29e3f";>6026738 Prepare 9.5.6 release https://github.com/squidfunk/mkdocs-material/commit/a65fd1ed9e958beccda9e8c23140251e7243ad86";>a65fd1e Updated distribution files https://github.com/squidfunk/mkdocs-material/commit/335dd3addb0c5875d42349a9a06ea2b4b7055374";>335dd3a Fixed Markdown formatted nodes in Mermaid.js https://github.com/squidfunk/mkdocs-material/commit/f55ce2c81b6c391c40a139660e3a1e0032182848";>f55ce2c Updated Insiders changelog https://github.com/squidfunk/mkdocs-material/commit/672414163c51b8cd753539724a8c139ca7eb538b";>6724141 Documentation See full diff in https://github.com/squidfunk/mkdocs-material/compare/9.5.5...9.5.6";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (depe
[PR] Build: Bump pytest from 7.4.4 to 8.0.0 [iceberg-python]
dependabot[bot] opened a new pull request, #319: URL: https://github.com/apache/iceberg-python/pull/319 Bumps [pytest](https://github.com/pytest-dev/pytest) from 7.4.4 to 8.0.0. Release notes Sourced from https://github.com/pytest-dev/pytest/releases";>pytest's releases. pytest 8.0.0 (2024-01-27) See https://github.com/pytest-dev/pytest/releases/tag/8.0.0rc1";>8.0.0rc1 and https://github.com/pytest-dev/pytest/releases/tag/8.0.0rc2";>8.0.0rc2 for the full changes since pytest 7.4! Bug Fixes https://redirect.github.com/pytest-dev/pytest/issues/11842";>#11842: Properly escape the reason of a skip{.interpreted-text role="ref"} mark when writing JUnit XML files. https://redirect.github.com/pytest-dev/pytest/issues/11861";>#11861: Avoid microsecond exceeds 1_000_000 when using log-date-format with %f specifier, which might cause the test suite to crash. 8.0.0rc2 pytest 8.0.0rc2 (2024-01-17) Improvements https://redirect.github.com/pytest-dev/pytest/issues/11233";>#11233: Improvements to -r for xfailures and xpasses: Report tracebacks for xfailures when -rx is set. Report captured output for xpasses when -rX is set. For xpasses, add - in summary between test name and reason, to match how xfail is displayed. https://redirect.github.com/pytest-dev/pytest/issues/11825";>#11825: The pytest_plugin_registered{.interpreted-text role="hook"} hook has a new plugin_name parameter containing the name by which plugin is registered. Bug Fixes https://redirect.github.com/pytest-dev/pytest/issues/11706";>#11706: Fix reporting of teardown errors in higher-scoped fixtures when using [--maxfail]{.title-ref} or [--stepwise]{.title-ref}. https://redirect.github.com/pytest-dev/pytest/issues/11758";>#11758: Fixed IndexError: string index out of range crash in if highlighted[-1] == "\n" and source[-1] != "\n". This bug was introduced in pytest 8.0.0rc1. https://redirect.github.com/pytest-dev/pytest/issues/9765";>#9765, https://redirect.github.com/pytest-dev/pytest/issues/11816";>#11816: Fixed a frustrating bug that afflicted some users with the only error being assert mod not in mods. The issue was caused by the fact that str(Path(mod)) and mod.__file__ don't necessarily produce the same string, and was being erroneously used interchangably in some places in the code. This fix also broke the internal API of PytestPluginManager.consider_conftest by introducing a new parameter -- we mention this in case it is being used by external code, even if marked as private. pytest 8.0.0rc1 (2023-12-30) See https://docs.pytest.org/en/latest/changelog.html#pytest-8-0-0rc1-2023-12-30";>https://docs.pytest.org/en/latest/changelog.html#pytest-8-0-0rc1-2023-12-30 for the rendered changelog. Breaking Changes Old Deprecations Are Now Errors https://redirect.github.com/pytest-dev/pytest/issues/7363";>#7363: PytestRemovedIn8Warning deprecation warnings are now errors by default. Following our plan to remove deprecated features with as little disruption as possible, all warnings of type PytestRemovedIn8Warning now generate errors instead of warning messages by default. The affected features will be effectively removed in pytest 8.1, so please consult the deprecations{.interpreted-text role="ref"} section in the docs for directions on how to update existing code. In the pytest 8.0.X series, it is possible to change the errors back into warnings as a stopgap measure by adding this to your pytest.ini file: [pytest] ... (truncated) Commits https://github.com/pytest-dev/pytest/commit/478f8233bca8147445f0c5129f04ada892cc6c91";>478f823 Prepare release version 8.0.0 https://github.com/pytest-dev/pytest/commit/608590097a6542768099dd371b84d8b37a1990da";>6085900 [8.0.x] fix: avoid rounding microsecond to 1_000_000 (https://redirect.github.com/pytest-dev/pytest/issues/11863";>#11863) https://github.com/pytest-dev/pytest/commit/3b41c65c81d649d962be5ec469f44104b8d09748";>3b41c65 [8.0.x] Escape skip reason in junitxml (https://redirect.github.com/pytest-dev/pytest/issues/11845";>#11845) https://github.com/pytest-dev/pytest/commit/747072ad26f2443dc8a62eb88db8cbf56fa95470";>747072a [8.0.x] Update docstring of scripts/generate-gh-release-notes.py (https://redirect.github.com/pytest-dev/pytest/issues/11768";>#11768) https://github.com/pytest-dev/pytest/commit/011a475baf6e1d0e9ec30c5996d9cbcbe7c95475";>011a475 Properly attach packages to the GH release notes (https://redirect.github.com/pytest-dev/pytest/issues/11839";>#11839) (https://redirect.github.com/pytest-dev/pytest/issues/11840";>#11840) https://github.com/pytest-dev/pytest/commit/97960bdd148972b2f26bd9b336163e590bbc4c6b";>97960bd Merge pull request https://redirect.github.com/pytest-dev/pytest/issues/11835";>#11835 from pytest-d
[PR] Build: Bump adlfs from 2023.12.0 to 2024.1.0 [iceberg-python]
dependabot[bot] opened a new pull request, #320: URL: https://github.com/apache/iceberg-python/pull/320 Bumps [adlfs](https://github.com/fsspec/adlfs) from 2023.12.0 to 2024.1.0. Release notes Sourced from https://github.com/fsspec/adlfs/releases";>adlfs's releases. 2024.1.0 What's Changed adlfs: fix version typo by https://github.com/efiop";>@efiop in https://redirect.github.com/fsspec/adlfs/pull/449";>fsspec/adlfs#449 Check for Hdi_isfolder with a capital by https://github.com/basnijholt";>@basnijholt in https://redirect.github.com/fsspec/adlfs/pull/418";>fsspec/adlfs#418 put_file: default to overwrite=True by https://github.com/pmrowla";>@pmrowla in https://redirect.github.com/fsspec/adlfs/pull/419";>fsspec/adlfs#419 Fix recursive delete on hierarchical namespace accounts by https://github.com/Tom-Newton";>@Tom-Newton in https://redirect.github.com/fsspec/adlfs/pull/454";>fsspec/adlfs#454 New Contributors https://github.com/Tom-Newton";>@Tom-Newton made their first contribution in https://redirect.github.com/fsspec/adlfs/pull/454";>fsspec/adlfs#454 Full Changelog: https://github.com/fsspec/adlfs/compare/2023.12.0...2024.1.0";>https://github.com/fsspec/adlfs/compare/2023.12.0...2024.1.0 Commits See full diff in https://github.com/fsspec/adlfs/compare/2023.12.0...2024.1.0";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Build: Bump moto from 4.2.13 to 5.0.0 [iceberg-python]
dependabot[bot] opened a new pull request, #321: URL: https://github.com/apache/iceberg-python/pull/321 Bumps [moto](https://github.com/getmoto/moto) from 4.2.13 to 5.0.0. Changelog Sourced from https://github.com/getmoto/moto/blob/master/CHANGELOG.md";>moto's changelog. 5.0.0 Docker Digest for 5.0.0: sha256:2faf2460a6446dfe472ac0d799e00341b1c84203d08540c247a6cc09be7c54e9 General: * All decorators have been replaced with a single decorator: `mock_aws` * The `mock_batch_simple` and `mock_lambda_simple` decorators can now be configured using the `config`-parameter: `@mock_aws(config={"batch": {"use_docker": False}, "lambda": {"use_docker": False}})` * It is now possible to configure methods/services which should reach out to AWS. @mock_aws( config={"core": {"mock_credentials": False, "passthrough": {"urls": [], "services": []}}} ) * All requests now return a RequestId Miscellaneous: * IAM: The AWS managed Policies are no longer loaded by default. If your application depends on these policies, tell Moto explicitly to load them like so: @mock_aws(config={"iam": {"load_aws_managed_policies": True}}) Or by setting an environment variable: MOTO_IAM_LOAD_MANAGED_POLICIES=true * S3: list_objects() now returns a hashed ContinuationToken 4.2.14 Docker Digest for 4.2.14: sha256:2fa10aa48e32f85c63c62a7d437b8a4b320a56a8494bc25d45ced370bc159c23 New Services: * Backup: * create_backup_plan() * create_backup_vault() * get_backup_plan() * describe_backup_vault() * delete_backup_plan() * list_backup_plans() * list_backup_vaults() * list_tags() * tag_resource() * untag_resource() New Methods: * RDS: ... (truncated) Commits https://github.com/getmoto/moto/commit/0255717edd0252608789d06f1452cee420819980";>0255717 Prep Release V5 https://github.com/getmoto/moto/commit/38f6359dce229430c89e805cd35e773d8f86ea2a";>38f6359 Refactoring: Adding typing for event message structure (https://redirect.github.com/getmoto/moto/issues/7234";>#7234) https://github.com/getmoto/moto/commit/38ee3fa66233d931a36bb3377b8d332488d92f5a";>38ee3fa Prep release 5.0.0alpha3 https://github.com/getmoto/moto/commit/09c5751671fb89a75a9a565d550c0fcdbf10489e";>09c5751 MotoServer: skip using HTTP_HOST header when determining request origin (https://redirect.github.com/getmoto/moto/issues/7225";>#7225) https://github.com/getmoto/moto/commit/8199a884468f5c2369040b8a3e50b57df3ab23fa";>8199a88 Techdebt: Update scaffolding to use mock_aws (https://redirect.github.com/getmoto/moto/issues/7221";>#7221) https://github.com/getmoto/moto/commit/dea4a98b64fe8b8a8d7b02e5446756d1b439af55";>dea4a98 Techdebt: Remove old/deprecated code (https://redirect.github.com/getmoto/moto/issues/7220";>#7220) https://github.com/getmoto/moto/commit/16ba3de855425448459237f5f522a15effd2fd12";>16ba3de Techdebt: Verify minimum boto3-version, up minimum responses version (https://redirect.github.com/getmoto/moto/issues/7218";>#7218) https://github.com/getmoto/moto/commit/f834d98314b22d748198a5244becad3373ffbba7";>f834d98 IAM: Only load AWS Managed Policies on request (https://redirect.github.com/getmoto/moto/issues/7219";>#7219) https://github.com/getmoto/moto/commit/0c8ccbb4069bf2db5420634580d0ed2bf45346d6";>0c8ccbb Techdebt: Hide private decorator-methods/attributes (https://redirect.github.com/getmoto/moto/issues/7216";>#7216) https://github.com/getmoto/moto/commit/5aa3cc9d735e32b7a6ae03dd1de25d94944d930f";>5aa3cc9 S3: list_objects_v2() should have a hashed NextContinuationToken (https://redirect.github.com/getmoto/moto/issues/7187";>#7187) Additional commits viewable in https://github.com/getmoto/moto/compare/4.2.13...5.0.0";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR
Re: [PR] Flink: Adds the ability to read from a branch on the Flink Iceberg Source [iceberg]
stevenzwu commented on code in PR #9547: URL: https://github.com/apache/iceberg/pull/9547#discussion_r1470308941 ## flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamScanSql.java: ## @@ -97,7 +98,38 @@ public void clean() { super.clean(); } + // private void insertRows(String partition, String branch, Table table, Row... rows) Review Comment: this should be removed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Use execution ID in executor cache [iceberg]
aokolnychyi commented on PR #9583: URL: https://github.com/apache/iceberg/pull/9583#issuecomment-1915715186 I am going to run this a number of times to make sure the new logic is not flaky and then will test it one more time on a cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [DO NOT MERGE] New docs switch [iceberg]
rdblue commented on PR #9520: URL: https://github.com/apache/iceberg/pull/9520#issuecomment-1915715262 @bitsondatadev, thanks! This looks more like what I was expecting! I'm +1 on this when the thread on the dev list concludes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Build: Bump pypa/cibuildwheel from 2.16.3 to 2.16.4 [iceberg-python]
dependabot[bot] opened a new pull request, #322: URL: https://github.com/apache/iceberg-python/pull/322 Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.16.3 to 2.16.4. Release notes Sourced from https://github.com/pypa/cibuildwheel/releases";>pypa/cibuildwheel's releases. v2.16.4 🛠 Update manylinux pins to upgrade from a problematic PyPy version. (https://redirect.github.com/pypa/cibuildwheel/issues/1737";>#1737) Changelog Sourced from https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md";>pypa/cibuildwheel's changelog. v2.16.4 28 January 2024 🛠 Update manylinux pins to upgrade from a problematic PyPy version. (https://redirect.github.com/pypa/cibuildwheel/issues/1737";>#1737) Commits https://github.com/pypa/cibuildwheel/commit/0b04ab1040366101259658b355777e4ff2d16f83";>0b04ab1 Bump version: v2.16.4 https://github.com/pypa/cibuildwheel/commit/34b049f1389fd9cfef5de1d1781ec67759770eea";>34b049f [Bot] Update dependencies (https://redirect.github.com/pypa/cibuildwheel/issues/1737";>#1737) https://github.com/pypa/cibuildwheel/commit/4a79413fe6d6fd88b4317bcf0e80252292fd34f5";>4a79413 Remove the Cirrus CI badge from readme See full diff in https://github.com/pypa/cibuildwheel/compare/v2.16.3...v2.16.4";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Build: Bump aiohttp from 3.9.1 to 3.9.2 [iceberg-python]
dependabot[bot] opened a new pull request, #323: URL: https://github.com/apache/iceberg-python/pull/323 Bumps [aiohttp](https://github.com/aio-libs/aiohttp) from 3.9.1 to 3.9.2. Release notes Sourced from https://github.com/aio-libs/aiohttp/releases";>aiohttp's releases. 3.9.2 Bug fixes Fixed server-side websocket connection leak. Related issues and pull requests on GitHub: https://redirect.github.com/aio-libs/aiohttp/issues/7978";>#7978. Fixed web.FileResponse doing blocking I/O in the event loop. Related issues and pull requests on GitHub: https://redirect.github.com/aio-libs/aiohttp/issues/8012";>#8012. Fixed double compress when compression enabled and compressed file exists in server file responses. Related issues and pull requests on GitHub: https://redirect.github.com/aio-libs/aiohttp/issues/8014";>#8014. Added runtime type check for ClientSession timeout parameter. Related issues and pull requests on GitHub: https://redirect.github.com/aio-libs/aiohttp/issues/8021";>#8021. Fixed an unhandled exception in the Python HTTP parser on header lines starting with a colon -- by :user:pajod. Invalid request lines with anything but a dot between the HTTP major and minor version are now rejected. Invalid header field names containing question mark or slash are now rejected. Such requests are incompatible with :rfc:9110#section-5.6.2 and are not known to be of any legitimate use. Related issues and pull requests on GitHub: https://redirect.github.com/aio-libs/aiohttp/issues/8074";>#8074. Improved validation of paths for static resources requests to the server -- by :user:bdraco. ... (truncated) Changelog Sourced from https://github.com/aio-libs/aiohttp/blob/master/CHANGES.rst";>aiohttp's changelog. 3.9.2 (2024-01-28) Bug fixes Fixed server-side websocket connection leak. Related issues and pull requests on GitHub: :issue:7978. Fixed web.FileResponse doing blocking I/O in the event loop. Related issues and pull requests on GitHub: :issue:8012. Fixed double compress when compression enabled and compressed file exists in server file responses. Related issues and pull requests on GitHub: :issue:8014. Added runtime type check for ClientSession timeout parameter. Related issues and pull requests on GitHub: :issue:8021. Fixed an unhandled exception in the Python HTTP parser on header lines starting with a colon -- by :user:pajod. Invalid request lines with anything but a dot between the HTTP major and minor version are now rejected. Invalid header field names containing question mark or slash are now rejected. Such requests are incompatible with :rfc:9110#section-5.6.2 and are not known to be of any legitimate use. Related issues and pull requests on GitHub: :issue:8074. ... (truncated) Commits https://github.com/aio-libs/aiohttp/commit/24a6d64966d99182e95f5d3a29541ef2fec397ad";>24a6d64 Release v3.9.2 (https://redirect.github.com/aio-libs/aiohttp/issues/8082";>#8082) https://github.com/aio-libs/aiohttp/commit/9118a5831e8a65b8c839eb7e4ac983e040ff41df";>9118a58 [PR https://redirect.github.com/aio-libs/aiohttp/issues/8079";>#8079/1c335944 backport][3.9] Validate static paths (https://redirect.github.com/aio-libs/aiohttp/issues/8080";>#8080) https://github.com/aio-libs/aiohttp/commit/435ad46e6c26cbf6ed9a38764e9ba8e7441a0e3b";>435ad46 [PR https://redirect.github.com/aio-libs/aiohttp/issues/3955";>#3955/8960063e backport][3.9] Replace all tmpdir fixtures with tmp_path (... https://github.com/aio-libs/aiohttp/commit/d33bc21414e283c9e6fe7f6caf69e2ed60d66c82";>d33bc21 Improve validation in HTTP parser (https://redirect.github.com/aio-libs/aiohttp/issues/8074";>#8074) (https://redirect.github.com/aio-libs/aiohttp/issues/8078";>#8078) https://github.com/aio-libs/aiohttp/commit/0d945d1be08f2ba8475513216a66411f053c3217";>0d945d1 [PR https://redirect.github.com/aio-libs/aiohttp/issues/7916";>#7916/822fbc74 backport][3.9] Add more information to contributing page (... https://github.com/aio-libs/aiohttp/commit/3ec4fa1f0e0a0dad218c75dbe5ed09e22d5cc284";>3ec4fa1 [PR https://redirect.github.com/aio-libs/aiohttp/issues/8069";>#8069/69bbe874 backport][3.9] 📝 Only show changelog draft for non-release... https://github.com/aio-libs/aiohttp/commit/419d715c42c46daf1a59e0aff61c1f6d10236982";>419d715 [PR https://redirect.github.com/aio-libs/aiohttp/issues/8066";>#8066/cba34699 backport][3.9] 💅📝 Restructure the changelog for clarity (#... https://github.com/aio-libs/aiohttp/commit/a54dab3b36bcf0d815b9244f52ae7bc5da08f387";>a54dab3 [PR https://redirect.github.com/aio-libs/aiohttp/issues/8049";>#8049/a379e634 backport][3.9] Set cause for ClientPayloadError (https://redirect.github.com/aio-libs/a
Re: [PR] Flink: Adds the ability to read from a branch on the Flink Iceberg Source [iceberg]
stevenzwu merged PR #9547: URL: https://github.com/apache/iceberg/pull/9547 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
HonahX commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470463965 ## tests/catalog/test_base.py: ## @@ -330,6 +334,34 @@ def test_create_table(catalog: InMemoryCatalog) -> None: assert catalog.load_table(TEST_TABLE_IDENTIFIER) == table +@pytest.mark.parametrize( +"schema,expected", +[ +(lazy_fixture("pyarrow_schema_simple_without_ids"), lazy_fixture("iceberg_schema_simple_no_ids")), +(lazy_fixture("iceberg_schema_simple"), lazy_fixture("iceberg_schema_simple")), +(lazy_fixture("iceberg_schema_nested"), lazy_fixture("iceberg_schema_nested")), +(lazy_fixture("pyarrow_schema_nested_without_ids"), lazy_fixture("iceberg_schema_nested_no_ids")), +], +) +def test_convert_schema_if_needed( +schema: Union[Schema, pa.Schema], +expected: Schema, +catalog: InMemoryCatalog, +) -> None: +assert expected == catalog._convert_schema_if_needed(schema) + + +def test_create_table_pyarrow_schema(catalog: InMemoryCatalog, pyarrow_schema_simple_without_ids: pa.Schema) -> None: +table = catalog.create_table( +identifier=TEST_TABLE_IDENTIFIER, +schema=pyarrow_schema_simple_without_ids, +location=TEST_TABLE_LOCATION, +partition_spec=TEST_TABLE_PARTITION_SPEC, +properties=TEST_TABLE_PROPERTIES, +) +assert catalog.load_table(TEST_TABLE_IDENTIFIER) == table + Review Comment: I think it may be better to have another assertion here: ``` assert table.schema() == iceberg_schema_simple ``` maybe also add fixture parametrization to include the `pyarrow_schema_nested` This is not a big deal. So probably we can discuss/modify this in some follow-up PRs. Currently, `test_sql.py` has covered this case: https://github.com/apache/iceberg-python/pull/305/files#diff-fad2e9bc49f619f03bf0631fce12da1f524c786008dbdfc01a35f6caf19a7c01R161 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
syun64 commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470488890 ## tests/catalog/test_base.py: ## @@ -330,6 +334,34 @@ def test_create_table(catalog: InMemoryCatalog) -> None: assert catalog.load_table(TEST_TABLE_IDENTIFIER) == table +@pytest.mark.parametrize( +"schema,expected", +[ +(lazy_fixture("pyarrow_schema_simple_without_ids"), lazy_fixture("iceberg_schema_simple_no_ids")), +(lazy_fixture("iceberg_schema_simple"), lazy_fixture("iceberg_schema_simple")), +(lazy_fixture("iceberg_schema_nested"), lazy_fixture("iceberg_schema_nested")), +(lazy_fixture("pyarrow_schema_nested_without_ids"), lazy_fixture("iceberg_schema_nested_no_ids")), +], +) +def test_convert_schema_if_needed( +schema: Union[Schema, pa.Schema], +expected: Schema, +catalog: InMemoryCatalog, +) -> None: +assert expected == catalog._convert_schema_if_needed(schema) + + +def test_create_table_pyarrow_schema(catalog: InMemoryCatalog, pyarrow_schema_simple_without_ids: pa.Schema) -> None: +table = catalog.create_table( +identifier=TEST_TABLE_IDENTIFIER, +schema=pyarrow_schema_simple_without_ids, +location=TEST_TABLE_LOCATION, +partition_spec=TEST_TABLE_PARTITION_SPEC, +properties=TEST_TABLE_PROPERTIES, +) +assert catalog.load_table(TEST_TABLE_IDENTIFIER) == table + Review Comment: Thank you for the great suggestion @HonahX . I just tried to update the test in test_base.py, but it requires us to use new_table_metadata instead of the existing mechanism of creating a new TableMetadataV1 instance, which creates a new uuid and breaks the cli/test_console.py tests. So I think it might be simpler to fix this once we get https://github.com/apache/iceberg-python/pull/289 merged in -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
HonahX commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470492242 ## tests/catalog/test_base.py: ## @@ -330,6 +334,34 @@ def test_create_table(catalog: InMemoryCatalog) -> None: assert catalog.load_table(TEST_TABLE_IDENTIFIER) == table +@pytest.mark.parametrize( +"schema,expected", +[ +(lazy_fixture("pyarrow_schema_simple_without_ids"), lazy_fixture("iceberg_schema_simple_no_ids")), +(lazy_fixture("iceberg_schema_simple"), lazy_fixture("iceberg_schema_simple")), +(lazy_fixture("iceberg_schema_nested"), lazy_fixture("iceberg_schema_nested")), +(lazy_fixture("pyarrow_schema_nested_without_ids"), lazy_fixture("iceberg_schema_nested_no_ids")), +], +) +def test_convert_schema_if_needed( +schema: Union[Schema, pa.Schema], +expected: Schema, +catalog: InMemoryCatalog, +) -> None: +assert expected == catalog._convert_schema_if_needed(schema) + + +def test_create_table_pyarrow_schema(catalog: InMemoryCatalog, pyarrow_schema_simple_without_ids: pa.Schema) -> None: +table = catalog.create_table( +identifier=TEST_TABLE_IDENTIFIER, +schema=pyarrow_schema_simple_without_ids, +location=TEST_TABLE_LOCATION, +partition_spec=TEST_TABLE_PARTITION_SPEC, +properties=TEST_TABLE_PROPERTIES, +) +assert catalog.load_table(TEST_TABLE_IDENTIFIER) == table + Review Comment: Thanks for trying this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
syun64 commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470494285 ## pyiceberg/schema.py: ## @@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: Union[Schema, IcebergType], next_id: class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]): """Traverses the schema and assigns monotonically increasing ids.""" -reserved_ids: Dict[int, int] +old_id_to_new_id: Dict[int, int] Review Comment: I tried this out, and I think we might have to leave this logic in for this iteration. The usage of create_table in PyIceberg is slightly different from the Java implementation in that the user can create an IcebergSchema, instead of a Spark DF schema or a PyArrow schema. This means that the user can choose to include identifier_field_ids in the IcebergSchema that they define. I think it might make sense to consolidate the way we track the identifier_field_ids on create_table, update_schema, replace_table and other similar operations in a future PR. I'll create an issue to track this task -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
syun64 commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470494285 ## pyiceberg/schema.py: ## @@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: Union[Schema, IcebergType], next_id: class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]): """Traverses the schema and assigns monotonically increasing ids.""" -reserved_ids: Dict[int, int] +old_id_to_new_id: Dict[int, int] Review Comment: I tried this out, and I think we might have to leave this logic in for this iteration. The usage of create_table in PyIceberg is slightly different from the Java implementation in that the user can create an IcebergSchema, instead of a Spark DF schema or a PyArrow schema. This means that the user can choose to include identifier_field_ids in the IcebergSchema that they define. I think it might make sense to consolidate the way we update the identifier_field_ids on create_table, update_schema, replace_table and other similar operations in a future PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
syun64 commented on PR #305: URL: https://github.com/apache/iceberg-python/pull/305#issuecomment-1915934234 Looks good to merge from my side. I left my closing comments on all the open discussions. Thank you @Fokko @kevinjqliu and @HonahX for your reviews! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
HonahX commented on PR #305: URL: https://github.com/apache/iceberg-python/pull/305#issuecomment-1915934692 Thanks @syun64 for the great work! Thanks @Fokko and @kevinjqliu for reviewing! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
HonahX merged PR #305: URL: https://github.com/apache/iceberg-python/pull/305 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] `create_table` with a PyArrow Schema [iceberg-python]
syun64 commented on code in PR #305: URL: https://github.com/apache/iceberg-python/pull/305#discussion_r1470494285 ## pyiceberg/schema.py: ## @@ -1221,50 +1221,57 @@ def assign_fresh_schema_ids(schema_or_type: Union[Schema, IcebergType], next_id: class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]): """Traverses the schema and assigns monotonically increasing ids.""" -reserved_ids: Dict[int, int] +old_id_to_new_id: Dict[int, int] Review Comment: I tried this out, and I think we might have to leave this logic in for this iteration. In PyIceberg, a user can create a table using IcebergSchema, instead of a PyArrow schema. This means that the user can choose to include identifier_field_ids in the IcebergSchema that they define, and the current create_table or assign_fresh_ids function calls don’t have the logic to handle this outside of _SetFreshIDs I think it might make sense to consolidate the way we update the identifier_field_ids on create_table, update_schema, replace_table and other similar operations in a future PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: add support for catalogs with glue implementation to start [iceberg-go]
zeroshade commented on PR #51: URL: https://github.com/apache/iceberg-go/pull/51#issuecomment-1915981210 @wolfeidau Can you add check marks to the appropriate spots in the README for the functionality you're adding? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: add parquet writer [iceberg-rust]
liurenjie1024 commented on code in PR #176: URL: https://github.com/apache/iceberg-rust/pull/176#discussion_r1470518241 ## crates/iceberg/src/writer/file_writer/parquet_writer.rs: ## Review Comment: I'm thinking moving this to another crate `iceberg-parquet`, the core crate only contains `FileWriter` trait. WDYT? cc @ZENOTME @Xuanwo @Fokko ## crates/iceberg/src/spec/manifest.rs: ## @@ -924,7 +924,7 @@ impl TryFrom for ManifestStatus { } /// Data file carries data file path, partition tuple, metrics, … -#[derive(Debug, PartialEq, Clone, Eq, TypedBuilder)] +#[derive(Debug, PartialEq, Clone, Eq, Builder)] Review Comment: Why this change? `TypedBuilder` checks it in compile time, while `Builder` checks result in runtime. ## crates/iceberg/src/io.rs: ## @@ -278,7 +288,7 @@ impl OutputFile { } /// Creates output file for writing. -pub async fn writer(&self) -> Result { +pub async fn writer(&self) -> Result { Review Comment: We can't make it concrete `Writer` here since we can't expose opendal api to user. You can store `Box` in parquet writer. ## crates/iceberg/src/io.rs: ## @@ -268,6 +268,16 @@ impl OutputFile { .await?) } +/// Delete file. +pub async fn delete(&self) -> Result<()> { Review Comment: This seems unsound to me. If it's deleted, we should consume it, so its argument should be `self` rather `&self` ## crates/iceberg/src/writer/file_writer/parquet_writer.rs: ## @@ -0,0 +1,267 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! The module contains the file writer for parquet file format. + +use std::{ +collections::HashMap, +sync::{atomic::AtomicI64, Arc}, +}; + +use crate::{ +io::OutputFile, +spec::{DataFileBuilder, DataFileFormat}, +writer::CurrentFileStatus, +Error, +}; +use arrow_schema::SchemaRef; +use opendal::Writer; +use parquet::file::properties::WriterProperties; +use parquet::{arrow::AsyncArrowWriter, format::FileMetaData}; + +use super::{track_writer::TrackWriter, FileWriter, FileWriterBuilder}; + +/// ParquetWriterBuilder is used to builder a [`ParquetWriter`] +#[derive(Clone)] +pub struct ParquetWriterBuilder { +out_file: OutputFile, +/// `buffer_size` determines the initial size of the intermediate buffer. +/// The intermediate buffer will automatically be resized if necessary +init_buffer_size: usize, +props: WriterProperties, +schema: SchemaRef, +} + +impl ParquetWriterBuilder { +/// Create a new `ParquetWriterBuilder` +pub fn new( +out_file: OutputFile, +init_buffer_size: usize, +props: WriterProperties, +schema: SchemaRef, +) -> Self { +Self { +out_file, +init_buffer_size, +props, +schema, +} +} +} + +impl FileWriterBuilder for ParquetWriterBuilder { +type R = ParquetWriter; + +async fn build(self) -> crate::Result { +let written_size = Arc::new(AtomicI64::new(0)); +let inner_writer = TrackWriter::new(self.out_file.writer().await?, written_size.clone()); +let writer = AsyncArrowWriter::try_new( +inner_writer, +self.schema, +self.init_buffer_size, +Some(self.props), +) +.map_err(|err| { +Error::new( +crate::ErrorKind::Unexpected, +"build error from parquet writer", +) +.with_source(err) +})?; +Ok(ParquetWriter { +out_file: self.out_file, +writer, +written_size, +current_row_num: 0, +}) +} +} + +/// `ParquetWriter`` is used to write arrow data into parquet file on storage. +pub struct ParquetWriter { +out_file: OutputFile, +writer: AsyncArrowWriter>, +written_size: Arc, +current_row_num: usize, +} + +impl ParquetWriter { +fn to_data_file_builder( +metadata: FileMetaData, +written_size: usize, +file_path: String, +) -> DataFileBuilder { +let (column_sizes, value_counts, null_value_counts) = { +let m
Re: [PR] Core: HadoopTable needs to skip file cleanup after task failure under some boundary conditions. [iceberg]
BsoBird commented on PR #9546: URL: https://github.com/apache/iceberg/pull/9546#issuecomment-1915987434 @szehon-ho Hi. can you check this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Rest Catalog: Add RESTful AppendFiles data operation [iceberg]
danielcweeks commented on code in PR #9292: URL: https://github.com/apache/iceberg/pull/9292#discussion_r1470559173 ## core/src/main/java/org/apache/iceberg/MetadataUpdate.java: ## @@ -490,4 +491,21 @@ public void applyTo(ViewMetadata.Builder viewMetadataBuilder) { viewMetadataBuilder.setCurrentVersionId(versionId); } } + + class AppendFilesUpdate implements MetadataUpdate { Review Comment: We might want to rename this to `AppendManifestsUpdate` as appending files would be more of a `DataFile` type object. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Rest Catalog: Add RESTful AppendFiles data operation [iceberg]
danielcweeks commented on code in PR #9292: URL: https://github.com/apache/iceberg/pull/9292#discussion_r1470562788 ## core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java: ## @@ -112,6 +111,7 @@ public class RESTSessionCatalog extends BaseViewSessionCatalog private static final Logger LOG = LoggerFactory.getLogger(RESTSessionCatalog.class); private static final String DEFAULT_FILE_IO_IMPL = "org.apache.iceberg.io.ResolvingFileIO"; private static final String REST_METRICS_REPORTING_ENABLED = "rest-metrics-reporting-enabled"; + private static final String REST_DATA_COMMIT_ENABLED = "rest-data-commit-enabled"; Review Comment: A few places we probably need to update the naming. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add UnboundSortOrder [iceberg-rust]
fqaiser94 commented on code in PR #115: URL: https://github.com/apache/iceberg-rust/pull/115#discussion_r1470584482 ## crates/iceberg/src/spec/sort.rs: ## @@ -88,15 +91,106 @@ impl SortOrder { pub fn is_unsorted(&self) -> bool { self.fields.is_empty() } + +/// Converts to an unbound sort order +pub fn to_unbound(&self) -> UnboundSortOrder { +UnboundSortOrder::builder() +.with_order_id(self.order_id) +.with_fields( +self.fields +.iter() +.map(|x| UnboundSortField { +source_id: x.source_id, +transform: x.transform, +direction: x.direction, +null_order: x.null_order, +}) +.collect_vec(), +) +.build() +.unwrap() +} +} + +/// Reference to [`UnboundSortOrder`]. +pub type UnboundSortOrderRef = Arc; + +/// Entry for every column that is to be sorted +#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, TypedBuilder)] +#[serde(rename_all = "kebab-case")] +pub struct UnboundSortField { +/// A source column id from the table’s schema +pub source_id: i32, +/// A transform that is used to produce values to be sorted on from the source column. +pub transform: Transform, +/// A sort direction, that can only be either asc or desc +pub direction: SortDirection, +/// A null order that describes the order of null values when sorted. +pub null_order: NullOrder, +} + +/// Unbound sort order can be later bound to a schema. +#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Builder, Default)] +#[serde(rename_all = "kebab-case")] +#[builder(setter(prefix = "with"))] +#[builder(build_fn(skip))] +pub struct UnboundSortOrder { +/// Identifier for the SortOrder, order_id `0` is no sort order. +pub order_id: i64, +/// Details of the sort +#[builder(setter(each(name = "with_sort_field")))] +pub fields: Vec, +} Review Comment: I'm not opposed to the idea of having separate builder methods. Implemented a rough version of that in the latest [commit](https://github.com/apache/iceberg-rust/pull/115/commits/5a6790323e2d0b697019a8b933f599a220d3540a). Take a look and let me know what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add UnboundSortOrder [iceberg-rust]
fqaiser94 commented on PR #115: URL: https://github.com/apache/iceberg-rust/pull/115#issuecomment-1916069215 Found some time to look at this again :D -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] InMemory Catalog Implementation [iceberg-python]
kevinjqliu commented on code in PR #289: URL: https://github.com/apache/iceberg-python/pull/289#discussion_r1470604775 ## pyiceberg/catalog/in_memory.py: ## @@ -0,0 +1,222 @@ +import uuid +from typing import ( +Dict, +List, +Optional, +Set, +Union, +) + +from pyiceberg.catalog import ( +Catalog, +Identifier, +Properties, +PropertiesUpdateSummary, +) +from pyiceberg.exceptions import ( +NamespaceAlreadyExistsError, +NamespaceNotEmptyError, +NoSuchNamespaceError, +NoSuchTableError, +TableAlreadyExistsError, +) +from pyiceberg.io import WAREHOUSE +from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec +from pyiceberg.schema import Schema +from pyiceberg.table import ( +CommitTableRequest, +CommitTableResponse, +Table, +update_table_metadata, +) +from pyiceberg.table.metadata import new_table_metadata +from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder +from pyiceberg.typedef import EMPTY_DICT + +DEFAULT_WAREHOUSE_LOCATION = "file:///tmp/warehouse" + + +class InMemoryCatalog(Catalog): +"""An in-memory catalog implementation.""" + +__tables: Dict[Identifier, Table] +__namespaces: Dict[Identifier, Properties] + +def __init__(self, name: str, **properties: str) -> None: +super().__init__(name, **properties) +self.__tables = {} +self.__namespaces = {} +self._warehouse_location = properties.get(WAREHOUSE, None) or DEFAULT_WAREHOUSE_LOCATION + +def create_table( +self, +identifier: Union[str, Identifier], +schema: Schema, +location: Optional[str] = None, +partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC, +sort_order: SortOrder = UNSORTED_SORT_ORDER, +properties: Properties = EMPTY_DICT, +table_uuid: Optional[uuid.UUID] = None, +) -> Table: +identifier = Catalog.identifier_to_tuple(identifier) +namespace = Catalog.namespace_from(identifier) + +if identifier in self.__tables: +raise TableAlreadyExistsError(f"Table already exists: {identifier}") +else: +if namespace not in self.__namespaces: +self.__namespaces[namespace] = {} + +if not location: +location = f'{self._warehouse_location}/{"/".join(identifier)}' + +metadata_location = f'{self._warehouse_location}/{"/".join(identifier)}/metadata/metadata.json' Review Comment: yep, the actual writing is done by `_commit_table` below, but the path of the metadata location is determined here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Fix flaky TestSparkExecutorCache [iceberg]
aokolnychyi commented on PR #9583: URL: https://github.com/apache/iceberg/pull/9583#issuecomment-1916101647 Re-triggering one more time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Fix flaky TestSparkExecutorCache [iceberg]
aokolnychyi closed pull request #9583: Spark 3.5: Fix flaky TestSparkExecutorCache URL: https://github.com/apache/iceberg/pull/9583 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Spark 3.5: Fix flaky TestSparkExecutorCache [iceberg]
aokolnychyi opened a new pull request, #9583: URL: https://github.com/apache/iceberg/pull/9583 This PR adds code to destroy all currently live table broadcasts after initialization to fix #9511. Our tests have the following behavior: ``` Append Append Execute a row-level operation ``` The cleanup logic for each of the appends may trigger invalidation in delete, potentially making all requests go to the underlying storage and skipping the cache. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] InMemory Catalog Implementation [iceberg-python]
kevinjqliu commented on PR #289: URL: https://github.com/apache/iceberg-python/pull/289#issuecomment-1916109508 > Should we also add this catalog to the tests in tests/integration/test_reads.py? @Fokko I had the same idea! Unfortunately, the way integration tests are configured right now requires table information to persist in the catalog from `provision.py` to the pytests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Cannot write to local filesystem [iceberg-python]
kevinjqliu closed issue #299: Cannot write to local filesystem URL: https://github.com/apache/iceberg-python/issues/299 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: add parquet writer [iceberg-rust]
Xuanwo commented on code in PR #176: URL: https://github.com/apache/iceberg-rust/pull/176#discussion_r1470623846 ## crates/iceberg/src/writer/file_writer/parquet_writer.rs: ## Review Comment: The files supported by iceberg are limited, which is not as extensive as catalogs. I think it's best to keep them in the same crate for easier use and management of dependencies. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: add parquet writer [iceberg-rust]
Xuanwo commented on code in PR #176: URL: https://github.com/apache/iceberg-rust/pull/176#discussion_r1470624504 ## crates/iceberg/src/io.rs: ## @@ -268,6 +268,16 @@ impl OutputFile { .await?) } +/// Delete file. +pub async fn delete(&self) -> Result<()> { Review Comment: I'm thinking of the bad cases: If the `delete` returns error, how can we retry this operation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] docs: Add community page [iceberg-python]
Fokko merged PR #315: URL: https://github.com/apache/iceberg-python/pull/315 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: add parquet writer [iceberg-rust]
Xuanwo commented on code in PR #176: URL: https://github.com/apache/iceberg-rust/pull/176#discussion_r1470626858 ## crates/iceberg/src/io.rs: ## @@ -278,7 +288,7 @@ impl OutputFile { } /// Creates output file for writing. -pub async fn writer(&self) -> Result { +pub async fn writer(&self) -> Result { Ok(self.op.writer(&self.path[self.relative_path_pos..]).await?) Review Comment: It's better to add a `buffer` for this writer to avoid `EntiryTooSmall` error from s3. --- There is another thing we can do is adding `concurrent` here so we can upload parts concurrently. But we need to make a design on how users control the concurrency. Maybe we can discuss them later. ## crates/iceberg/src/io.rs: ## @@ -268,6 +268,16 @@ impl OutputFile { .await?) } +/// Delete file. +pub async fn delete(&self) -> Result<()> { +// #TODO +// Do we need to check if file exists? Review Comment: It depends on whether we want to consider it an error if the file doesn't exist. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: add parquet writer [iceberg-rust]
Xuanwo commented on code in PR #176: URL: https://github.com/apache/iceberg-rust/pull/176#discussion_r1470626858 ## crates/iceberg/src/io.rs: ## @@ -278,7 +288,7 @@ impl OutputFile { } /// Creates output file for writing. -pub async fn writer(&self) -> Result { +pub async fn writer(&self) -> Result { Ok(self.op.writer(&self.path[self.relative_path_pos..]).await?) Review Comment: It's better to add a `buffer` (for example, 8MiB)for this writer to avoid `EntiryTooSmall` error from s3. --- There is another thing we can do is adding `concurrent` here so we can upload parts concurrently. But we need to make a design on how users control the concurrency. Maybe we can discuss them later. ## crates/iceberg/src/io.rs: ## @@ -278,7 +288,7 @@ impl OutputFile { } /// Creates output file for writing. -pub async fn writer(&self) -> Result { +pub async fn writer(&self) -> Result { Ok(self.op.writer(&self.path[self.relative_path_pos..]).await?) Review Comment: It's better to add a `buffer` (for example, 8MiB) for this writer to avoid `EntiryTooSmall` error from s3. --- There is another thing we can do is adding `concurrent` here so we can upload parts concurrently. But we need to make a design on how users control the concurrency. Maybe we can discuss them later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: add parquet writer [iceberg-rust]
Xuanwo commented on code in PR #176: URL: https://github.com/apache/iceberg-rust/pull/176#discussion_r1470626858 ## crates/iceberg/src/io.rs: ## @@ -278,7 +288,7 @@ impl OutputFile { } /// Creates output file for writing. -pub async fn writer(&self) -> Result { +pub async fn writer(&self) -> Result { Ok(self.op.writer(&self.path[self.relative_path_pos..]).await?) Review Comment: It's better to add a `buffer` (for example, 8MiB) for this writer to avoid `EntiryTooSmall` error from s3. And in the future, we can add a config for this to allow users to control them. --- There is another thing we can do is adding `concurrent` here so we can upload parts concurrently. But we need to make a design on how users control the concurrency. Maybe we can discuss them later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump aiohttp from 3.9.1 to 3.9.2 [iceberg-python]
Fokko merged PR #323: URL: https://github.com/apache/iceberg-python/pull/323 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump adlfs from 2023.12.0 to 2024.1.0 [iceberg-python]
Fokko merged PR #320: URL: https://github.com/apache/iceberg-python/pull/320 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump pydantic from 2.5.3 to 2.6.0 [iceberg-python]
Fokko merged PR #317: URL: https://github.com/apache/iceberg-python/pull/317 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org