[GitHub] [iceberg] yuangjiang opened a new issue, #6236: aused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 3072 bytes
yuangjiang opened a new issue, #6236: URL: https://github.com/apache/iceberg/issues/6236 ### Apache Iceberg version main (development) ### Query engine Spark ### Please describe the bug ๐ Iceberg spark cannot create a table using jdbc catalog, prompting that catalog initialization failed Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 3072 bytes at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at com.mysql.jdbc.Util.handleNewInstance(Util.java:425) at com.mysql.jdbc.Util.getInstance(Util.java:408) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3978) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3914) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2530) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2683) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2495) at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1903) at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1242) at org.apache.iceberg.jdbc.JdbcCatalog.lambda$initializeCatalogTables$1(JdbcCatalog.java:152) at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:58) at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51) at org.apache.iceberg.jdbc.JdbcCatalog.initializeCatalogTables(JdbcCatalog.java:135) at org.apache.iceberg.jdbc.JdbcCatalog.initialize(JdbcCatalog.java:103) ... 87 more org.apache.iceberg.jdbc.UncheckedSQLException: Cannot initialize JDBC catalog at org.apache.iceberg.jdbc.JdbcCatalog.initialize(JdbcCatalog.java:109) at org.apache.iceberg.CatalogUtil.loadCatalog(CatalogUtil.java:212) at org.apache.iceberg.CatalogUtil.buildIcebergCatalog(CatalogUtil.java:254) at org.apache.iceberg.spark.SparkCatalog.buildIcebergCatalog(SparkCatalog.java:125) at org.apache.iceberg.spark.SparkCatalog.initialize(SparkCatalog.java:470) at org.apache.spark.sql.connector.catalog.Catalogs$.load(Catalogs.scala:60) at org.apache.spark.sql.connector.catalog.CatalogManager.$anonfun$catalog$1(CatalogManager.scala:52) at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86) at org.apache.spark.sql.connector.catalog.CatalogManager.catalog(CatalogManager.scala:52) at org.apache.spark.sql.connector.catalog.LookupCatalog$CatalogAndIdentifier$.unapply(LookupCatalog.scala:123) at org.apache.spark.sql.connector.catalog.LookupCatalog$NonSessionCatalogAndIdentifier$.unapply(LookupCatalog.scala:73) at org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$NonSessionCatalogAndTable$.unapply(ResolveCatalogs.scala:99) at org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:35) at org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:33) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$2(AnalysisHelper.scala:170) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$1(AnalysisHelper.scala:170) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:323) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning(AnalysisHelper.scala:168) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning$(AnalysisHelper.scala:164) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning(AnalysisHelper.scala:99) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning$(AnalysisHelper.scala:96) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:76) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:75) at org.apache.spark.s
[GitHub] [iceberg] yuangjiang commented on issue #6236: aused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 3072 bytes
yuangjiang commented on issue #6236: URL: https://github.com/apache/iceberg/issues/6236#issuecomment-1321628781 My submit command is as follows bin/spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.0.0 \ --conf spark.sql.catalog.prod=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.prod.warehouse=file://tmp/iceberg \ --conf spark.sql.catalog.prod.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog \ --conf spark.sql.catalog.prod.uri=jdbc:mysql://localhost:3306/iceberg \ --conf spark.sql.catalog.prod.jdbc.verifyServerCertificate=false \ --conf spark.sql.catalog.prod.jdbc.useSSL=false \ --conf spark.sql.catalog.prod.jdbc.user=iceberg \ --conf spark.sql.catalog.prod.jdbc.password=password -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hameizi opened a new pull request, #6237: Core: Fix check is delete file and data file overlap
hameizi opened a new pull request, #6237: URL: https://github.com/apache/iceberg/pull/6237 Just `deleteLower` and `deleteUpper` less than `dataLower ` is true or `deleteLower` and `deleteUpper` greater than `dataUpper` is true mean there is no overlap between the delete-file and data-file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hameizi commented on pull request #6237: Core: Fix check is delete file and data file overlap
hameizi commented on PR #6237: URL: https://github.com/apache/iceberg/pull/6237#issuecomment-1321652265 @rdblue can you take a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] yuangjiang commented on issue #6236: Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 3072 bytes
yuangjiang commented on issue #6236: URL: https://github.com/apache/iceberg/issues/6236#issuecomment-1321699076 CREATE TABLE `iceberg_namespace_properties` ( `catalog_name` varchar(255) NOT NULL, `namespace` varchar(255) NOT NULL, `property_key` varchar(255) NOT NULL, `property_value` varchar(5500) DEFAULT NULL, PRIMARY KEY (`catalog_name`,`namespace`,`property_key`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci; When I modify the following properties, is the NAMESPACE_PROPERTY_KEY in the normal CREATE_NAMESPACE_PROPERTIES_TABLE set too large? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on pull request #6238: Nessie: Make UpdateableReference public
nastra commented on PR #6238: URL: https://github.com/apache/iceberg/pull/6238#issuecomment-1321831628 I think it would be actually better if `NessieIcebergClient` would have a re-usable commit operation rathern than exposing this class here. I talked to @ajantha-bhat and he'll add it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra closed pull request #6238: Nessie: Make UpdateableReference public
nastra closed pull request #6238: Nessie: Make UpdateableReference public URL: https://github.com/apache/iceberg/pull/6238 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko merged pull request #6212: Replace ImmutableMap.Builder.build() with buildOrThrow()
Fokko merged PR #6212: URL: https://github.com/apache/iceberg/pull/6212 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko merged pull request #6228: Python: Minor fixes to expression types
Fokko merged PR #6228: URL: https://github.com/apache/iceberg/pull/6228 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] zhongyujiang commented on pull request #6237: Core: Fix check is delete file and data file overlap
zhongyujiang commented on PR #6237: URL: https://github.com/apache/iceberg/pull/6237#issuecomment-1321994998 I think the current judgment has already dealt with this situation, IIUC, deletes and data wil **not overlap** if: `(dataLower > deleteUpper) || (deleteLower > dataUpper)` So they **will overlap** when `!((dataLower > deleteUpper) || (deleteLower > dataUpper)) ` -> `(dataLower <= deleteUpper) && (deleteLower <= dataUpper)` which is exactly the current judgment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] pvary commented on pull request #6175: Hive: Add UGI to the key in CachedClientPool
pvary commented on PR #6175: URL: https://github.com/apache/iceberg/pull/6175#issuecomment-1322002592 @flyrain: The Catalog is a strange animal. We basically expect it to be a short lived and easy/cheap to drop and recreate. I had a similar discussion [1] on another PR around this. I Hive code we definitely create Catalog objects with considering the costs negligible. I feel that we bump into the issue of closing/keeping the catalogs more and more, and the current working is kind of non-intuitive, but changing this would be a serious impact so we should consider carefully. In the meantime I would support the idea of the configurable cache. [1] https://github.com/apache/iceberg/pull/5166 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
Fokko commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1027642501 ## python/pyiceberg/table/__init__.py: ## @@ -138,7 +157,10 @@ def __eq__(self, other: Any) -> bool: ) -class TableScan: +S = TypeVar("S", bound="TableScan", covariant=True) Review Comment: Instead of introducing a generic, we can also return `Self` which feels more Pythonic to me: https://peps.python.org/pep-0673/ ## python/pyiceberg/table/__init__.py: ## @@ -199,16 +223,143 @@ def use_ref(self, name: str): raise ValueError(f"Cannot scan unknown ref={name}") -def select(self, *field_names: str) -> TableScan: +def select(self, *field_names: str) -> S: if "*" in self.selected_fields: return self.update(selected_fields=field_names) return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names -def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan: +def filter_rows(self, new_row_filter: BooleanExpression) -> S: return self.update(row_filter=And(self.row_filter, new_row_filter)) -def filter_partitions(self, new_partition_filter: BooleanExpression) -> TableScan: +def filter_partitions(self, new_partition_filter: BooleanExpression) -> S: return self.update(partition_filter=And(self.partition_filter, new_partition_filter)) -def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan: +def with_case_sensitive(self, case_sensitive: bool = True) -> S: return self.update(case_sensitive=case_sensitive) + + +class ScanTask(ABC): +pass + + +@dataclass(init=False) Review Comment: Why the `dataclass` here? Should we make it at least frozen? ## python/pyiceberg/table/__init__.py: ## @@ -199,16 +223,143 @@ def use_ref(self, name: str): raise ValueError(f"Cannot scan unknown ref={name}") -def select(self, *field_names: str) -> TableScan: +def select(self, *field_names: str) -> S: if "*" in self.selected_fields: return self.update(selected_fields=field_names) return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names -def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan: +def filter_rows(self, new_row_filter: BooleanExpression) -> S: return self.update(row_filter=And(self.row_filter, new_row_filter)) -def filter_partitions(self, new_partition_filter: BooleanExpression) -> TableScan: +def filter_partitions(self, new_partition_filter: BooleanExpression) -> S: return self.update(partition_filter=And(self.partition_filter, new_partition_filter)) -def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan: +def with_case_sensitive(self, case_sensitive: bool = True) -> S: return self.update(case_sensitive=case_sensitive) + + +class ScanTask(ABC): +pass + + +@dataclass(init=False) +class FileScanTask(ScanTask): +file: DataFile +start: int +length: int + +def __init__(self, data_file: DataFile, start: Optional[int] = None, length: Optional[int] = None): +self.file = data_file +self.start = start or 0 +self.length = length or data_file.file_size_in_bytes + + +class _DictAsStruct(StructProtocol): +pos_to_name: dict[int, str] +wrapped: dict[str, Any] + +def __init__(self, partition_type: StructType): +self.pos_to_name = {} +for pos, field in enumerate(partition_type.fields): +self.pos_to_name[pos] = field.name + +def wrap(self, to_wrap: dict[str, Any]) -> _DictAsStruct: +self.wrapped = to_wrap +return self + +def get(self, pos: int) -> Any: +return self.wrapped[self.pos_to_name[pos]] + +def set(self, pos: int, value: Any) -> None: +raise NotImplementedError("Cannot set values in DictAsStruct") + + +class DataScan(TableScan["DataScan"]): +def __init__( +self, +table: Table, +row_filter: Optional[BooleanExpression] = None, +partition_filter: Optional[BooleanExpression] = None, +selected_fields: Tuple[str] = ("*",), +case_sensitive: bool = True, +snapshot_id: Optional[int] = None, +options: Properties = EMPTY_DICT, +): +super().__init__(table, row_filter, partition_filter, selected_fields, case_sensitive, snapshot_id, options) + +def plan_files(self) -> Iterator[ScanTask]: +snapshot = self.snapshot() +if not snapshot: +return () + +io = self.table.io + +# step 1: filter manifests using partition summaries +# the filter depends on the partition spec used to write the manifest file, so create a cache of filters for each spec id + +@cache +def manifest_filter(spec_id: int) -> Callable[[ManifestFile], bool]: Review Commen
[GitHub] [iceberg] SHuixo commented on issue #6104: Rewrite iceberg small files with flink succeeds but no snapshot is generated (V2 - upsert model)
SHuixo commented on issue #6104: URL: https://github.com/apache/iceberg/issues/6104#issuecomment-1322093088 > Yes, we have to wait it to be merged. Good, looking forward to the merger of this rockdb new feature. > Had a look about your exception log. The reason is the cdc contains a delete row, but the compaction for such case that contains delete files hasn't been supported. This means that in the CDC data that is streaming to Iceberg, don't have a viable data compression scheme for data streams that contain delete operations at this stage? In order to verify whether the **commit exception** in the flow scenario has a similar problem in the batch scenario, we made the following attempts: ``` 1. Enable flink CDC streaming writing iceberg, and the checkpoint is 5 minutes; 2. When the writer is running, start the compression program until a **commit exception** occurs; 3. When the above **commit exception** occurs, stop the CDC data writing program and compression program; 4. Turn on the data compression program again until the program is up and running. ``` > The following figure shows the start and end times when flink CDC writes data to iceberg๏ผ https://user-images.githubusercontent.com/20868410/203070483-d39c3107-ca61-4d31-bb30-7d63cf821697.PNG";> > Some of the logs are as follows: [compact-data-when-stream-write.log](https://github.com/apache/iceberg/files/10056982/compact-data-when-stream-write.log) [compact-data-when-write-finish.log](https://github.com/apache/iceberg/files/10056985/compact-data-when-write-finish.log) Here's a question,is it possible to pause the writer for data compression once, and when the data compression is completed, resume the data writing from the checkpoint again, and handle the above commit exception by cyclically suspending, compressing, and writing again? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
Fokko commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028076839 ## python/pyiceberg/table/__init__.py: ## @@ -199,16 +223,143 @@ def use_ref(self, name: str): raise ValueError(f"Cannot scan unknown ref={name}") -def select(self, *field_names: str) -> TableScan: +def select(self, *field_names: str) -> S: if "*" in self.selected_fields: return self.update(selected_fields=field_names) return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names -def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan: +def filter_rows(self, new_row_filter: BooleanExpression) -> S: return self.update(row_filter=And(self.row_filter, new_row_filter)) -def filter_partitions(self, new_partition_filter: BooleanExpression) -> TableScan: +def filter_partitions(self, new_partition_filter: BooleanExpression) -> S: return self.update(partition_filter=And(self.partition_filter, new_partition_filter)) -def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan: +def with_case_sensitive(self, case_sensitive: bool = True) -> S: return self.update(case_sensitive=case_sensitive) + + +class ScanTask(ABC): +pass + + +@dataclass(init=False) +class FileScanTask(ScanTask): +file: DataFile +start: int +length: int + +def __init__(self, data_file: DataFile, start: Optional[int] = None, length: Optional[int] = None): +self.file = data_file +self.start = start or 0 +self.length = length or data_file.file_size_in_bytes + + +class _DictAsStruct(StructProtocol): +pos_to_name: dict[int, str] +wrapped: dict[str, Any] + +def __init__(self, partition_type: StructType): +self.pos_to_name = {} +for pos, field in enumerate(partition_type.fields): +self.pos_to_name[pos] = field.name + +def wrap(self, to_wrap: dict[str, Any]) -> _DictAsStruct: +self.wrapped = to_wrap +return self + +def get(self, pos: int) -> Any: +return self.wrapped[self.pos_to_name[pos]] + +def set(self, pos: int, value: Any) -> None: +raise NotImplementedError("Cannot set values in DictAsStruct") + + +class DataScan(TableScan["DataScan"]): +def __init__( +self, +table: Table, +row_filter: Optional[BooleanExpression] = None, +partition_filter: Optional[BooleanExpression] = None, +selected_fields: Tuple[str] = ("*",), +case_sensitive: bool = True, +snapshot_id: Optional[int] = None, +options: Properties = EMPTY_DICT, +): +super().__init__(table, row_filter, partition_filter, selected_fields, case_sensitive, snapshot_id, options) + +def plan_files(self) -> Iterator[ScanTask]: +snapshot = self.snapshot() +if not snapshot: +return () + +io = self.table.io + +# step 1: filter manifests using partition summaries +# the filter depends on the partition spec used to write the manifest file, so create a cache of filters for each spec id + +@cache +def manifest_filter(spec_id: int) -> Callable[[ManifestFile], bool]: +spec = self.table.specs()[spec_id] +return visitors.manifest_evaluator(spec, self.table.schema(), self.partition_filter, self.case_sensitive) + +def partition_summary_filter(manifest_file: ManifestFile) -> bool: Review Comment: Maybe we should also align on this as a community and stick with filter or list comprehension, where I would prefer the latter :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko opened a new pull request, #6239: Docs: Select the right Spark catalog
Fokko opened a new pull request, #6239: URL: https://github.com/apache/iceberg/pull/6239 I copy pasted the example, but still had to select the catalog. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko merged pull request #6034: Python: GlueCatalog Full Implementation
Fokko merged PR #6034: URL: https://github.com/apache/iceberg/pull/6034 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] asheeshgarg commented on issue #6003: Vectorized Read
asheeshgarg commented on issue #6003: URL: https://github.com/apache/iceberg/issues/6003#issuecomment-1322337039 @nastra I have removed the spark dependency and just added raw org.apache.iceberg iceberg-common 1.0.0 org.apache.iceberg iceberg-core 1.0.0 org.apache.iceberg iceberg-data 1.0.0 org.apache.iceberg iceberg-hive-metastore 1.0.0 org.apache.iceberg iceberg-parquet 1.0.0 org.apache.iceberg iceberg-arrow 1.0.0 org.apache.hadoop hadoop-common 3.3.1 I am getting java.lang.NoClassDefFoundError: org/apache/thrift/TException Do we have list of depedency that we need to set in order to read the datafiles from iceberg in Pure Java API without any processing Engine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028276873 ## python/pyiceberg/table/__init__.py: ## @@ -138,7 +157,10 @@ def __eq__(self, other: Any) -> bool: ) -class TableScan: +S = TypeVar("S", bound="TableScan", covariant=True) Review Comment: Nice! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028278238 ## python/pyiceberg/table/__init__.py: ## @@ -199,16 +223,143 @@ def use_ref(self, name: str): raise ValueError(f"Cannot scan unknown ref={name}") -def select(self, *field_names: str) -> TableScan: +def select(self, *field_names: str) -> S: if "*" in self.selected_fields: return self.update(selected_fields=field_names) return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names -def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan: +def filter_rows(self, new_row_filter: BooleanExpression) -> S: return self.update(row_filter=And(self.row_filter, new_row_filter)) -def filter_partitions(self, new_partition_filter: BooleanExpression) -> TableScan: +def filter_partitions(self, new_partition_filter: BooleanExpression) -> S: return self.update(partition_filter=And(self.partition_filter, new_partition_filter)) -def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan: +def with_case_sensitive(self, case_sensitive: bool = True) -> S: return self.update(case_sensitive=case_sensitive) + + +class ScanTask(ABC): +pass + + +@dataclass(init=False) Review Comment: This was to get repr for free. I didn't make it frozen because we reuse it to wrap each partition tuple. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028282275 ## python/pyiceberg/table/__init__.py: ## @@ -199,16 +223,143 @@ def use_ref(self, name: str): raise ValueError(f"Cannot scan unknown ref={name}") -def select(self, *field_names: str) -> TableScan: +def select(self, *field_names: str) -> S: if "*" in self.selected_fields: return self.update(selected_fields=field_names) return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names -def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan: +def filter_rows(self, new_row_filter: BooleanExpression) -> S: return self.update(row_filter=And(self.row_filter, new_row_filter)) -def filter_partitions(self, new_partition_filter: BooleanExpression) -> TableScan: +def filter_partitions(self, new_partition_filter: BooleanExpression) -> S: return self.update(partition_filter=And(self.partition_filter, new_partition_filter)) -def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan: +def with_case_sensitive(self, case_sensitive: bool = True) -> S: return self.update(case_sensitive=case_sensitive) + + +class ScanTask(ABC): +pass + + +@dataclass(init=False) +class FileScanTask(ScanTask): +file: DataFile +start: int +length: int + +def __init__(self, data_file: DataFile, start: Optional[int] = None, length: Optional[int] = None): +self.file = data_file Review Comment: This is `file` in the JVM version, which is why I went with that. The JVM one is way more complicated than we need it to be, so I'll have to think about how we want to structure the task classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028282699 ## python/pyiceberg/table/__init__.py: ## @@ -199,16 +223,143 @@ def use_ref(self, name: str): raise ValueError(f"Cannot scan unknown ref={name}") -def select(self, *field_names: str) -> TableScan: +def select(self, *field_names: str) -> S: if "*" in self.selected_fields: return self.update(selected_fields=field_names) return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names -def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan: +def filter_rows(self, new_row_filter: BooleanExpression) -> S: return self.update(row_filter=And(self.row_filter, new_row_filter)) -def filter_partitions(self, new_partition_filter: BooleanExpression) -> TableScan: +def filter_partitions(self, new_partition_filter: BooleanExpression) -> S: return self.update(partition_filter=And(self.partition_filter, new_partition_filter)) -def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan: +def with_case_sensitive(self, case_sensitive: bool = True) -> S: return self.update(case_sensitive=case_sensitive) + + +class ScanTask(ABC): +pass + + +@dataclass(init=False) +class FileScanTask(ScanTask): +file: DataFile +start: int +length: int + +def __init__(self, data_file: DataFile, start: Optional[int] = None, length: Optional[int] = None): +self.file = data_file +self.start = start or 0 +self.length = length or data_file.file_size_in_bytes + + +class _DictAsStruct(StructProtocol): +pos_to_name: dict[int, str] +wrapped: dict[str, Any] + +def __init__(self, partition_type: StructType): +self.pos_to_name = {} +for pos, field in enumerate(partition_type.fields): +self.pos_to_name[pos] = field.name + +def wrap(self, to_wrap: dict[str, Any]) -> _DictAsStruct: +self.wrapped = to_wrap +return self + +def get(self, pos: int) -> Any: +return self.wrapped[self.pos_to_name[pos]] + +def set(self, pos: int, value: Any) -> None: +raise NotImplementedError("Cannot set values in DictAsStruct") + + +class DataScan(TableScan["DataScan"]): +def __init__( +self, +table: Table, +row_filter: Optional[BooleanExpression] = None, +partition_filter: Optional[BooleanExpression] = None, +selected_fields: Tuple[str] = ("*",), +case_sensitive: bool = True, +snapshot_id: Optional[int] = None, +options: Properties = EMPTY_DICT, +): +super().__init__(table, row_filter, partition_filter, selected_fields, case_sensitive, snapshot_id, options) + +def plan_files(self) -> Iterator[ScanTask]: +snapshot = self.snapshot() +if not snapshot: +return () Review Comment: I wasn't sure how to return an empty generator, so I just returned a tuple. Happy to update to something better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on issue #6003: Vectorized Read
nastra commented on issue #6003: URL: https://github.com/apache/iceberg/issues/6003#issuecomment-1322348315 Can you try `iceberg-hive-runtime` rather than `iceberg-hive-metastore` (you'd also need to remove `iceberg-parquet` as that's shaded inside `iceberg-hive-runtime`)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028283685 ## python/pyiceberg/manifest.py: ## @@ -141,6 +141,14 @@ def read_manifest_entry(input_file: InputFile) -> Iterator[ManifestEntry]: yield ManifestEntry(**dict_repr) +def live_entries(input_file: InputFile) -> Iterator[ManifestEntry]: +return filter(lambda entry: entry.status != ManifestEntryStatus.DELETED, read_manifest_entry(input_file)) Review Comment: I think we want the planning path to use a generator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6232: Python: Disallow Any generics
rdblue commented on code in PR #6232: URL: https://github.com/apache/iceberg/pull/6232#discussion_r1028308314 ## python/pyiceberg/expressions/literals.py: ## @@ -207,7 +207,7 @@ def __init__(self, value: int): super().__init__(value, int) @singledispatchmethod -def to(self, type_var: IcebergType) -> Literal: +def to(self, type_var: IcebergType) -> Literal: # type: ignore Review Comment: Is there no way to return an unknown type from a method in Python? In Java, we would just declare that the method returns `Literal` where `T` comes from context. Then you get a type that may not be correct, but is at least not ``. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] asheeshgarg commented on issue #6003: Vectorized Read
asheeshgarg commented on issue #6003: URL: https://github.com/apache/iceberg/issues/6003#issuecomment-1322426914 @nastra org.apache.iceberg iceberg-common 1.0.0 org.apache.iceberg iceberg-core 1.0.0 org.apache.iceberg iceberg-data 1.0.0 org.apache.iceberg iceberg-arrow 1.0.0 org.apache.iceberg iceberg-hive-runtime 1.0.0 org.apache.hadoop hadoop-common 3.3.1 still getting java.lang.NoClassDefFoundError: org/apache/thrift/TException -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat opened a new pull request, #6240: Nessie: Refactor NessieTableOperations#doCommit
ajantha-bhat opened a new pull request, #6240: URL: https://github.com/apache/iceberg/pull/6240 Move core logic from `NessieTableOperations#doCommit` to `NessieIcebergClient#commitTable` because Trino Nessie catalog integration (https://github.com/trinodb/trino/pull/11701) don't use Iceberg's `NessieTableOperation` directly. Hence to avoid code duplication, move common logic to `NessieIcebergClient` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on pull request #6240: Nessie: Refactor NessieTableOperations#doCommit
ajantha-bhat commented on PR #6240: URL: https://github.com/apache/iceberg/pull/6240#issuecomment-1322441438 cc: @nastra -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on issue #6003: Vectorized Read
nastra commented on issue #6003: URL: https://github.com/apache/iceberg/issues/6003#issuecomment-1322441996 NoClassDefFoundError generally means that this dependency existed at compilation time but doesn't exist at runtime, so you're missing the right dependency for that. Can you provide a full stack trace and all dependencies that you're using? Or is the above list complete? Or do you have a direct link to this project where this issue happens? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on a diff in pull request #6240: Nessie: Refactor NessieTableOperations#doCommit
nastra commented on code in PR #6240: URL: https://github.com/apache/iceberg/pull/6240#discussion_r1028365461 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java: ## @@ -46,15 +50,7 @@ import org.projectnessie.error.NessieNamespaceNotFoundException; import org.projectnessie.error.NessieNotFoundException; import org.projectnessie.error.NessieReferenceNotFoundException; -import org.projectnessie.model.Branch; -import org.projectnessie.model.Content; -import org.projectnessie.model.ContentKey; -import org.projectnessie.model.EntriesResponse; -import org.projectnessie.model.GetNamespacesResponse; -import org.projectnessie.model.IcebergTable; -import org.projectnessie.model.Operation; -import org.projectnessie.model.Reference; -import org.projectnessie.model.Tag; +import org.projectnessie.model.*; Review Comment: I don't think we allow * imports in the project -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] asheeshgarg commented on issue #6003: Vectorized Read
asheeshgarg commented on issue #6003: URL: https://github.com/apache/iceberg/issues/6003#issuecomment-1322444351 @nastra is there a complete list of dependency that I can use for Pure Java API program above list is complete I have added org.apache.thrift libthrift 0.16.0 Now getting java.lang.ClassNotFoundException: org.apache.hadoop.hive.metastore.api.UnknownDBException -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6239: Docs: Select the right Spark catalog
ajantha-bhat commented on code in PR #6239: URL: https://github.com/apache/iceberg/pull/6239#discussion_r1028370385 ## docs/aws.md: ## @@ -68,6 +68,7 @@ done # start Spark SQL client shell spark-sql --packages $DEPENDENCIES \ +--conf spark.sql.defaultCatalog=my_catalog \ Review Comment: In the same file there are multiple `spark-sql` commands, we need to add to all of these and also check in the other files? :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6240: Nessie: Refactor NessieTableOperations#doCommit
ajantha-bhat commented on code in PR #6240: URL: https://github.com/apache/iceberg/pull/6240#discussion_r1028371605 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java: ## @@ -46,15 +50,7 @@ import org.projectnessie.error.NessieNamespaceNotFoundException; import org.projectnessie.error.NessieNotFoundException; import org.projectnessie.error.NessieReferenceNotFoundException; -import org.projectnessie.model.Branch; -import org.projectnessie.model.Content; -import org.projectnessie.model.ContentKey; -import org.projectnessie.model.EntriesResponse; -import org.projectnessie.model.GetNamespacesResponse; -import org.projectnessie.model.IcebergTable; -import org.projectnessie.model.Operation; -import org.projectnessie.model.Reference; -import org.projectnessie.model.Tag; +import org.projectnessie.model.*; Review Comment: It is draft PR. My bad. Let me clean it up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] mderoy opened a new issue, #6241: Updating the HiveCatalog Conf with setConf does not also reset FileIO
mderoy opened a new issue, #6241: URL: https://github.com/apache/iceberg/issues/6241 ### Apache Iceberg version 0.14.0 ### Query engine _No response_ ### Please describe the bug ๐ The Iceberg HiveCatalog class lets you set a configuration via ``` @Override public void setConf(Configuration conf) { this.conf = new Configuration(conf); } ``` however doing this does not actually update things in the class which depend on it like the FileIO... our application was working properly until we tried to remove the S3 credential environment variables and switch to passing them programmatically as part of the HiveConfunfortunately though we did this ``` m_catalog.initialize("hive", properties); m_catalog.setConf(config); ``` instead of ``` m_catalog.setConf(config); m_catalog.initialize("hive", properties); ``` which led to the error `No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider` Working my way back through the stack I was able to realize that the Conf being passed was blank which is why the credentials weren't being setup. and reading the source I realized that a HadoopFileIO was set using a blank Conf during initialize(), even though our class now had the proper conf set after to avoid such issues I'd prefer if setConf was removed and instead replaced with a constructor which accepted the Configuration (and leave a constructor without this to create a new blank Conf)... or maybe have setConf throw an error if it's already setor maybe even have setConf call initialize() to set these things to use the proper Conf -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] mderoy commented on issue #6241: Updating the HiveCatalog Conf with setConf does not also reset FileIO
mderoy commented on issue #6241: URL: https://github.com/apache/iceberg/issues/6241#issuecomment-1322538070 I understand this is mostly "user error" but I needed to read the source to get to the root of the problem which was not obvious. I'd like it if the interface could enforce thisbut in the worst case even if it gets closed, the error message is attached to the ticket and is now searchable for anyone else that may struggle with this for a few days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko opened a new pull request, #6242: API: Restore the type of the identity transform
Fokko opened a new pull request, #6242: URL: https://github.com/apache/iceberg/pull/6242 This caused some regression for the Iceberg 1.1.0 release: ``` 2022-11-21T12:05:46.6549795Z [ERROR] io.trino.plugin.iceberg.TestIcebergSystemTables.testManifestsTable Time elapsed: 0.701 s <<< FAILURE! 2022-11-21T12:05:46.6550853Z java.lang.AssertionError: 2022-11-21T12:05:46.6551986Z [Rows for query [SELECT added_data_files_count, existing_rows_count, added_rows_count, deleted_data_files_count, deleted_rows_count, partitions FROM test_schema."test_table$manifests"]] 2022-11-21T12:05:46.6553075Z Expecting: 2022-11-21T12:05:46.6553593Z <(2, 0, 3, 0, 0, [[false, false, 18148, 18149]]), (2, 0, 3, 0, 0, [[false, false, 18147, 18148]])> 2022-11-21T12:05:46.6553980Z to contain exactly in any order: 2022-11-21T12:05:46.6554557Z <[(2, 0, 3, 0, 0, [[false, false, 2019-09-08, 2019-09-09]]), 2022-11-21T12:05:46.6554992Z (2, 0, 3, 0, 0, [[false, false, 2019-09-09, 2019-09-10]])]> 2022-11-21T12:05:46.6555273Z elements not found: 2022-11-21T12:05:46.6555804Z <(2, 0, 3, 0, 0, [[false, false, 2019-09-08, 2019-09-09]]), (2, 0, 3, 0, 0, [[false, false, 2019-09-09, 2019-09-10]])> 2022-11-21T12:05:46.6556132Z and elements not expected: 2022-11-21T12:05:46.6556488Z <(2, 0, 3, 0, 0, [[false, false, 18148, 18149]]), (2, 0, 3, 0, 0, [[false, false, 18147, 18148]])> ``` The system tables (manifests in this example above), would return the days since epoch instead of a date. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028437459 ## python/pyiceberg/table/__init__.py: ## @@ -16,24 +16,43 @@ # under the License. from __future__ import annotations +from abc import ABC, abstractmethod +from dataclasses import dataclass +from functools import cache Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on pull request #6242: API: Restore the type of the identity transform
rdblue commented on PR #6242: URL: https://github.com/apache/iceberg/pull/6242#issuecomment-1322541809 Good catch. Thanks, @Fokko! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028440440 ## python/pyiceberg/table/__init__.py: ## @@ -138,7 +157,10 @@ def __eq__(self, other: Any) -> bool: ) -class TableScan: +S = TypeVar("S", bound="TableScan", covariant=True) Review Comment: I tried this, but the one in typing_extensions didn't work like the one in 3.11 so I rolled it back to this type variable. Maybe we can figure it out later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko opened a new issue, #6243: Python: BoundType and BoundPredicate should match type
Fokko opened a new issue, #6243: URL: https://github.com/apache/iceberg/issues/6243 ### Feature Request / Improvement When we bind an unbound expression, we'll convert the literal values to the type of field that it is being bound to. If we instantiate a BoundPredicate, the type should be correct, and we should throw an error otherwise. ### Query engine _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028472043 ## python/pyiceberg/table/__init__.py: ## @@ -199,16 +223,143 @@ def use_ref(self, name: str): raise ValueError(f"Cannot scan unknown ref={name}") -def select(self, *field_names: str) -> TableScan: +def select(self, *field_names: str) -> S: if "*" in self.selected_fields: return self.update(selected_fields=field_names) return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names -def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan: +def filter_rows(self, new_row_filter: BooleanExpression) -> S: return self.update(row_filter=And(self.row_filter, new_row_filter)) -def filter_partitions(self, new_partition_filter: BooleanExpression) -> TableScan: +def filter_partitions(self, new_partition_filter: BooleanExpression) -> S: return self.update(partition_filter=And(self.partition_filter, new_partition_filter)) -def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan: +def with_case_sensitive(self, case_sensitive: bool = True) -> S: return self.update(case_sensitive=case_sensitive) + + +class ScanTask(ABC): +pass + + +@dataclass(init=False) +class FileScanTask(ScanTask): +file: DataFile +start: int +length: int + +def __init__(self, data_file: DataFile, start: Optional[int] = None, length: Optional[int] = None): +self.file = data_file +self.start = start or 0 +self.length = length or data_file.file_size_in_bytes + + +class _DictAsStruct(StructProtocol): +pos_to_name: dict[int, str] +wrapped: dict[str, Any] + +def __init__(self, partition_type: StructType): +self.pos_to_name = {} +for pos, field in enumerate(partition_type.fields): +self.pos_to_name[pos] = field.name + +def wrap(self, to_wrap: dict[str, Any]) -> _DictAsStruct: +self.wrapped = to_wrap +return self + +def get(self, pos: int) -> Any: +return self.wrapped[self.pos_to_name[pos]] + +def set(self, pos: int, value: Any) -> None: +raise NotImplementedError("Cannot set values in DictAsStruct") + + +class DataScan(TableScan["DataScan"]): +def __init__( +self, +table: Table, +row_filter: Optional[BooleanExpression] = None, +partition_filter: Optional[BooleanExpression] = None, +selected_fields: Tuple[str] = ("*",), +case_sensitive: bool = True, +snapshot_id: Optional[int] = None, +options: Properties = EMPTY_DICT, +): +super().__init__(table, row_filter, partition_filter, selected_fields, case_sensitive, snapshot_id, options) + +def plan_files(self) -> Iterator[ScanTask]: +snapshot = self.snapshot() +if not snapshot: +return () + +io = self.table.io + +# step 1: filter manifests using partition summaries +# the filter depends on the partition spec used to write the manifest file, so create a cache of filters for each spec id + +@cache +def manifest_filter(spec_id: int) -> Callable[[ManifestFile], bool]: +spec = self.table.specs()[spec_id] +return visitors.manifest_evaluator(spec, self.table.schema(), self.partition_filter, self.case_sensitive) + +def partition_summary_filter(manifest_file: ManifestFile) -> bool: Review Comment: Updated to the comprehension. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028473168 ## python/pyiceberg/table/__init__.py: ## @@ -199,16 +223,143 @@ def use_ref(self, name: str): raise ValueError(f"Cannot scan unknown ref={name}") -def select(self, *field_names: str) -> TableScan: +def select(self, *field_names: str) -> S: if "*" in self.selected_fields: return self.update(selected_fields=field_names) return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names -def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan: +def filter_rows(self, new_row_filter: BooleanExpression) -> S: return self.update(row_filter=And(self.row_filter, new_row_filter)) -def filter_partitions(self, new_partition_filter: BooleanExpression) -> TableScan: +def filter_partitions(self, new_partition_filter: BooleanExpression) -> S: return self.update(partition_filter=And(self.partition_filter, new_partition_filter)) -def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan: +def with_case_sensitive(self, case_sensitive: bool = True) -> S: return self.update(case_sensitive=case_sensitive) + + +class ScanTask(ABC): +pass + + +@dataclass(init=False) +class FileScanTask(ScanTask): +file: DataFile +start: int +length: int + +def __init__(self, data_file: DataFile, start: Optional[int] = None, length: Optional[int] = None): +self.file = data_file +self.start = start or 0 +self.length = length or data_file.file_size_in_bytes + + +class _DictAsStruct(StructProtocol): +pos_to_name: dict[int, str] +wrapped: dict[str, Any] + +def __init__(self, partition_type: StructType): +self.pos_to_name = {} +for pos, field in enumerate(partition_type.fields): +self.pos_to_name[pos] = field.name + +def wrap(self, to_wrap: dict[str, Any]) -> _DictAsStruct: +self.wrapped = to_wrap +return self + +def get(self, pos: int) -> Any: +return self.wrapped[self.pos_to_name[pos]] + +def set(self, pos: int, value: Any) -> None: +raise NotImplementedError("Cannot set values in DictAsStruct") + + +class DataScan(TableScan["DataScan"]): +def __init__( +self, +table: Table, +row_filter: Optional[BooleanExpression] = None, +partition_filter: Optional[BooleanExpression] = None, +selected_fields: Tuple[str] = ("*",), +case_sensitive: bool = True, +snapshot_id: Optional[int] = None, +options: Properties = EMPTY_DICT, +): +super().__init__(table, row_filter, partition_filter, selected_fields, case_sensitive, snapshot_id, options) + +def plan_files(self) -> Iterator[ScanTask]: +snapshot = self.snapshot() +if not snapshot: +return () Review Comment: Looks like I can just `return` here and it works. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6232: Python: Disallow Any generics
Fokko commented on code in PR #6232: URL: https://github.com/apache/iceberg/pull/6232#discussion_r1028473547 ## python/pyiceberg/expressions/literals.py: ## @@ -207,7 +207,7 @@ def __init__(self, value: int): super().__init__(value, int) @singledispatchmethod -def to(self, type_var: IcebergType) -> Literal: +def to(self, type_var: IcebergType) -> Literal: # type: ignore Review Comment: I don't like this either, the problem is that it is unbound. I have some idea's on how to fix this properly in https://github.com/apache/iceberg/issues/6231 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028476399 ## python/pyiceberg/table/__init__.py: ## @@ -16,24 +16,43 @@ # under the License. from __future__ import annotations +from abc import ABC, abstractmethod +from dataclasses import dataclass +from functools import cache Review Comment: Moving these methods to the instance and using `lru_cache` caused lint failures: ``` python/pyiceberg/table/__init__.py:290:5: W1518: 'lru_cache(maxsize=None)' or 'cache' will keep all method args alive indefinitely, including 'self' (method-cache-max-size-none) ``` This was fixed if I use `lru_cache` but don't move the methods to instance methods. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028477122 ## python/pyiceberg/table/__init__.py: ## @@ -199,16 +223,143 @@ def use_ref(self, name: str): raise ValueError(f"Cannot scan unknown ref={name}") -def select(self, *field_names: str) -> TableScan: +def select(self, *field_names: str) -> S: if "*" in self.selected_fields: return self.update(selected_fields=field_names) return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names -def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan: +def filter_rows(self, new_row_filter: BooleanExpression) -> S: return self.update(row_filter=And(self.row_filter, new_row_filter)) -def filter_partitions(self, new_partition_filter: BooleanExpression) -> TableScan: +def filter_partitions(self, new_partition_filter: BooleanExpression) -> S: return self.update(partition_filter=And(self.partition_filter, new_partition_filter)) -def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan: +def with_case_sensitive(self, case_sensitive: bool = True) -> S: return self.update(case_sensitive=case_sensitive) + + +class ScanTask(ABC): +pass + + +@dataclass(init=False) +class FileScanTask(ScanTask): +file: DataFile +start: int +length: int + +def __init__(self, data_file: DataFile, start: Optional[int] = None, length: Optional[int] = None): +self.file = data_file +self.start = start or 0 +self.length = length or data_file.file_size_in_bytes + + +class _DictAsStruct(StructProtocol): +pos_to_name: dict[int, str] +wrapped: dict[str, Any] + +def __init__(self, partition_type: StructType): +self.pos_to_name = {} Review Comment: Updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028477438 ## python/pyiceberg/table/__init__.py: ## @@ -199,16 +223,143 @@ def use_ref(self, name: str): raise ValueError(f"Cannot scan unknown ref={name}") -def select(self, *field_names: str) -> TableScan: +def select(self, *field_names: str) -> S: if "*" in self.selected_fields: return self.update(selected_fields=field_names) return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names -def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan: +def filter_rows(self, new_row_filter: BooleanExpression) -> S: return self.update(row_filter=And(self.row_filter, new_row_filter)) -def filter_partitions(self, new_partition_filter: BooleanExpression) -> TableScan: +def filter_partitions(self, new_partition_filter: BooleanExpression) -> S: return self.update(partition_filter=And(self.partition_filter, new_partition_filter)) -def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan: +def with_case_sensitive(self, case_sensitive: bool = True) -> S: return self.update(case_sensitive=case_sensitive) + + +class ScanTask(ABC): +pass + + +@dataclass(init=False) +class FileScanTask(ScanTask): +file: DataFile +start: int +length: int + +def __init__(self, data_file: DataFile, start: Optional[int] = None, length: Optional[int] = None): +self.file = data_file +self.start = start or 0 +self.length = length or data_file.file_size_in_bytes + + +class _DictAsStruct(StructProtocol): +pos_to_name: dict[int, str] Review Comment: Done. ## python/pyiceberg/table/__init__.py: ## @@ -199,16 +223,143 @@ def use_ref(self, name: str): raise ValueError(f"Cannot scan unknown ref={name}") -def select(self, *field_names: str) -> TableScan: +def select(self, *field_names: str) -> S: if "*" in self.selected_fields: return self.update(selected_fields=field_names) return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names -def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan: +def filter_rows(self, new_row_filter: BooleanExpression) -> S: return self.update(row_filter=And(self.row_filter, new_row_filter)) -def filter_partitions(self, new_partition_filter: BooleanExpression) -> TableScan: +def filter_partitions(self, new_partition_filter: BooleanExpression) -> S: return self.update(partition_filter=And(self.partition_filter, new_partition_filter)) -def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan: +def with_case_sensitive(self, case_sensitive: bool = True) -> S: return self.update(case_sensitive=case_sensitive) + + +class ScanTask(ABC): +pass + + +@dataclass(init=False) +class FileScanTask(ScanTask): +file: DataFile +start: int +length: int + +def __init__(self, data_file: DataFile, start: Optional[int] = None, length: Optional[int] = None): +self.file = data_file +self.start = start or 0 +self.length = length or data_file.file_size_in_bytes + + +class _DictAsStruct(StructProtocol): +pos_to_name: dict[int, str] +wrapped: dict[str, Any] Review Comment: Done. ## python/pyiceberg/table/__init__.py: ## @@ -199,16 +223,143 @@ def use_ref(self, name: str): raise ValueError(f"Cannot scan unknown ref={name}") -def select(self, *field_names: str) -> TableScan: +def select(self, *field_names: str) -> S: if "*" in self.selected_fields: return self.update(selected_fields=field_names) return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names -def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan: +def filter_rows(self, new_row_filter: BooleanExpression) -> S: return self.update(row_filter=And(self.row_filter, new_row_filter)) -def filter_partitions(self, new_partition_filter: BooleanExpression) -> TableScan: +def filter_partitions(self, new_partition_filter: BooleanExpression) -> S: return self.update(partition_filter=And(self.partition_filter, new_partition_filter)) -def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan: +def with_case_sensitive(self, case_sensitive: bool = True) -> S: return self.update(case_sensitive=case_sensitive) + + +class ScanTask(ABC): +pass + + +@dataclass(init=False) +class FileScanTask(ScanTask): +file: DataFile +start: int +length: int + +def __init__(self, data_file: DataFile, start: Optional[int] = None, length: Optional[int] = None): +self.file = data_file +self.start = start or
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028478168 ## python/pyproject.toml: ## @@ -84,6 +84,7 @@ build-backend = "poetry.core.masonry.api" [tool.poetry.extras] pyarrow = ["pyarrow"] +duckdb = ["duckdb"] Review Comment: Added the import. Will update docs, too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko merged pull request #6232: Python: Disallow Any generics
Fokko merged PR #6232: URL: https://github.com/apache/iceberg/pull/6232 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028489268 ## python/pyiceberg/table/__init__.py: ## @@ -199,16 +223,143 @@ def use_ref(self, name: str): raise ValueError(f"Cannot scan unknown ref={name}") -def select(self, *field_names: str) -> TableScan: +def select(self, *field_names: str) -> S: if "*" in self.selected_fields: return self.update(selected_fields=field_names) return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names -def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan: +def filter_rows(self, new_row_filter: BooleanExpression) -> S: return self.update(row_filter=And(self.row_filter, new_row_filter)) -def filter_partitions(self, new_partition_filter: BooleanExpression) -> TableScan: +def filter_partitions(self, new_partition_filter: BooleanExpression) -> S: return self.update(partition_filter=And(self.partition_filter, new_partition_filter)) -def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan: +def with_case_sensitive(self, case_sensitive: bool = True) -> S: return self.update(case_sensitive=case_sensitive) + + +class ScanTask(ABC): +pass + + +@dataclass(init=False) +class FileScanTask(ScanTask): +file: DataFile +start: int +length: int + +def __init__(self, data_file: DataFile, start: Optional[int] = None, length: Optional[int] = None): +self.file = data_file +self.start = start or 0 +self.length = length or data_file.file_size_in_bytes + + +class _DictAsStruct(StructProtocol): +pos_to_name: dict[int, str] +wrapped: dict[str, Any] + +def __init__(self, partition_type: StructType): +self.pos_to_name = {} +for pos, field in enumerate(partition_type.fields): +self.pos_to_name[pos] = field.name + +def wrap(self, to_wrap: dict[str, Any]) -> _DictAsStruct: +self.wrapped = to_wrap +return self + +def get(self, pos: int) -> Any: +return self.wrapped[self.pos_to_name[pos]] + +def set(self, pos: int, value: Any) -> None: +raise NotImplementedError("Cannot set values in DictAsStruct") + + +class DataScan(TableScan["DataScan"]): +def __init__( +self, +table: Table, +row_filter: Optional[BooleanExpression] = None, +partition_filter: Optional[BooleanExpression] = None, +selected_fields: Tuple[str] = ("*",), +case_sensitive: bool = True, +snapshot_id: Optional[int] = None, +options: Properties = EMPTY_DICT, +): +super().__init__(table, row_filter, partition_filter, selected_fields, case_sensitive, snapshot_id, options) + +def plan_files(self) -> Iterator[ScanTask]: +snapshot = self.snapshot() +if not snapshot: +return () + +io = self.table.io + +# step 1: filter manifests using partition summaries +# the filter depends on the partition spec used to write the manifest file, so create a cache of filters for each spec id + +@cache +def manifest_filter(spec_id: int) -> Callable[[ManifestFile], bool]: Review Comment: This conflicted with the use of `@lru_cache`, so I ended up removing the cache and making these instance helper methods. To cache the values, I added a `KeyDefaultDict` that initializes values using these helpers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028489268 ## python/pyiceberg/table/__init__.py: ## @@ -199,16 +223,143 @@ def use_ref(self, name: str): raise ValueError(f"Cannot scan unknown ref={name}") -def select(self, *field_names: str) -> TableScan: +def select(self, *field_names: str) -> S: if "*" in self.selected_fields: return self.update(selected_fields=field_names) return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names -def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan: +def filter_rows(self, new_row_filter: BooleanExpression) -> S: return self.update(row_filter=And(self.row_filter, new_row_filter)) -def filter_partitions(self, new_partition_filter: BooleanExpression) -> TableScan: +def filter_partitions(self, new_partition_filter: BooleanExpression) -> S: return self.update(partition_filter=And(self.partition_filter, new_partition_filter)) -def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan: +def with_case_sensitive(self, case_sensitive: bool = True) -> S: return self.update(case_sensitive=case_sensitive) + + +class ScanTask(ABC): +pass + + +@dataclass(init=False) +class FileScanTask(ScanTask): +file: DataFile +start: int +length: int + +def __init__(self, data_file: DataFile, start: Optional[int] = None, length: Optional[int] = None): +self.file = data_file +self.start = start or 0 +self.length = length or data_file.file_size_in_bytes + + +class _DictAsStruct(StructProtocol): +pos_to_name: dict[int, str] +wrapped: dict[str, Any] + +def __init__(self, partition_type: StructType): +self.pos_to_name = {} +for pos, field in enumerate(partition_type.fields): +self.pos_to_name[pos] = field.name + +def wrap(self, to_wrap: dict[str, Any]) -> _DictAsStruct: +self.wrapped = to_wrap +return self + +def get(self, pos: int) -> Any: +return self.wrapped[self.pos_to_name[pos]] + +def set(self, pos: int, value: Any) -> None: +raise NotImplementedError("Cannot set values in DictAsStruct") + + +class DataScan(TableScan["DataScan"]): +def __init__( +self, +table: Table, +row_filter: Optional[BooleanExpression] = None, +partition_filter: Optional[BooleanExpression] = None, +selected_fields: Tuple[str] = ("*",), +case_sensitive: bool = True, +snapshot_id: Optional[int] = None, +options: Properties = EMPTY_DICT, +): +super().__init__(table, row_filter, partition_filter, selected_fields, case_sensitive, snapshot_id, options) + +def plan_files(self) -> Iterator[ScanTask]: +snapshot = self.snapshot() +if not snapshot: +return () + +io = self.table.io + +# step 1: filter manifests using partition summaries +# the filter depends on the partition spec used to write the manifest file, so create a cache of filters for each spec id + +@cache +def manifest_filter(spec_id: int) -> Callable[[ManifestFile], bool]: Review Comment: This conflicted with the use of `@lru_cache`, so I ended up removing the cache and making these instance helper methods. To cache the values, I added a [`KeyDefaultDict`](https://stackoverflow.com/questions/2912231/is-there-a-clever-way-to-pass-the-key-to-defaultdicts-default-factory) that initializes values using these helpers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028501655 ## python/pyiceberg/table/__init__.py: ## @@ -16,24 +16,43 @@ # under the License. from __future__ import annotations +from abc import ABC, abstractmethod +from dataclasses import dataclass +from functools import cache Review Comment: I removed this in favor of a dict that creates default values based on key. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ahshahid commented on issue #6198: colStats flag in TableContext remains false except in situation where delete files are present
ahshahid commented on issue #6198: URL: https://github.com/apache/iceberg/issues/6198#issuecomment-1322664170 I am trying to reproduce the behaviour which I saw -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] asheeshgarg commented on issue #6003: Vectorized Read
asheeshgarg commented on issue #6003: URL: https://github.com/apache/iceberg/issues/6003#issuecomment-1322683140 After could of jar iteration able to read the catalog now for reading the data I am getting Exception in thread "main" java.lang.ExceptionInInitializerError at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1161) at org.apache.arrow.vector.BaseVariableWidthVector.initOffsetBuffer(BaseVariableWidthVector.java:242) at org.apache.arrow.vector.BaseVariableWidthVector.allocateBytes(BaseVariableWidthVector.java:467) at org.apache.arrow.vector.BaseVariableWidthVector.allocateNew(BaseVariableWidthVector.java:419) at org.apache.arrow.vector.BaseVariableWidthVector.allocateNewSafe(BaseVariableWidthVector.java:393) at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateVectorBasedOnOriginalType(VectorizedArrowReader.java:247) at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateFieldVector(VectorizedArrowReader.java:218) at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:132) at org.apache.iceberg.arrow.vectorized.ArrowBatchReader.read(ArrowBatchReader.java:46) at org.apache.iceberg.arrow.vectorized.ArrowBatchReader.read(ArrowBatchReader.java:29) at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:149) at org.apache.iceberg.arrow.vectorized.ArrowReader$VectorizedCombinedScanIterator.next(ArrowReader.java:313) at org.apache.iceberg.arrow.vectorized.ArrowReader$VectorizedCombinedScanIterator.next(ArrowReader.java:189) at com.jpmorgan.Test.readContent(Test.java:62) at com.jpmorgan.Test.main(Test.java:72) Caused by: java.lang.RuntimeException: Failed to initialize MemoryUtil. at org.apache.arrow.memory.util.MemoryUtil.(MemoryUtil.java:136) ... 15 more Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make field long java.nio.Buffer.address accessible: module java.base does not "opens java.nio" to unnamed module @3fb875d8 at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354) at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297) at java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:178) at java.base/java.lang.reflect.Field.setAccessible(Field.java:172) at org.apache.arrow.memory.util.MemoryUtil.(MemoryUtil.java:84) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] JonasJ-ap commented on pull request #5331: WIP: Adding support for Delta to Iceberg migration
JonasJ-ap commented on PR #5331: URL: https://github.com/apache/iceberg/pull/5331#issuecomment-1322684304 Hi @ericlgoodman. My name is Rushan Jiang, a CS undergrad at CMU. I am interested in learning and contributing to this migration support. I saw you did not update this PR for some time. Would you mind allowing me to continue your work? I appreciate your time and consideration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6233: Python: Implement DataScan.plan_files
Fokko commented on code in PR #6233: URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028552098 ## python/pyiceberg/table/__init__.py: ## @@ -199,16 +223,144 @@ def use_ref(self, name: str): raise ValueError(f"Cannot scan unknown ref={name}") -def select(self, *field_names: str) -> TableScan: +def select(self, *field_names: str) -> S: if "*" in self.selected_fields: return self.update(selected_fields=field_names) return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names -def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan: +def filter_rows(self, new_row_filter: BooleanExpression) -> S: return self.update(row_filter=And(self.row_filter, new_row_filter)) -def filter_partitions(self, new_partition_filter: BooleanExpression) -> TableScan: +def filter_partitions(self, new_partition_filter: BooleanExpression) -> S: return self.update(partition_filter=And(self.partition_filter, new_partition_filter)) -def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan: +def with_case_sensitive(self, case_sensitive: bool = True) -> S: return self.update(case_sensitive=case_sensitive) + + +class ScanTask(ABC): +pass + + +@dataclass(init=False) +class FileScanTask(ScanTask): +file: DataFile +start: int +length: int + +def __init__(self, data_file: DataFile, start: Optional[int] = None, length: Optional[int] = None): +self.file = data_file +self.start = start or 0 +self.length = length or data_file.file_size_in_bytes + + +class _DictAsStruct(StructProtocol): +pos_to_name: Dict[int, str] +wrapped: Dict[str, Any] + +def __init__(self, partition_type: StructType): +self.pos_to_name = {pos: field.name for pos, field in enumerate(partition_type.fields)} + +def wrap(self, to_wrap: Dict[str, Any]) -> _DictAsStruct: +self.wrapped = to_wrap +return self + +def get(self, pos: int) -> Any: +return self.wrapped[self.pos_to_name[pos]] + +def set(self, pos: int, value: Any) -> None: +raise NotImplementedError("Cannot set values in DictAsStruct") + + +class DataScan(TableScan["DataScan"]): +def __init__( +self, +table: Table, +row_filter: Optional[BooleanExpression] = None, +partition_filter: Optional[BooleanExpression] = None, +selected_fields: Tuple[str] = ("*",), +case_sensitive: bool = True, +snapshot_id: Optional[int] = None, +options: Properties = EMPTY_DICT, +): +super().__init__(table, row_filter, partition_filter, selected_fields, case_sensitive, snapshot_id, options) + +def _build_manifest_evaluator(self, spec_id: int) -> Callable[[ManifestFile], bool]: +spec = self.table.specs()[spec_id] +return visitors.manifest_evaluator(spec, self.table.schema(), self.partition_filter, self.case_sensitive) + +def _build_partition_evaluator(self, spec_id: int) -> Callable[[DataFile], bool]: +spec = self.table.specs()[spec_id] +partition_type = spec.partition_type(self.table.schema()) +partition_schema = Schema(*partition_type.fields) + +# TODO: project the row filter # pylint: disable=W0511 +partition_expr = And(self.partition_filter, AlwaysTrue()) + +# TODO: remove the dict to struct wrapper by using a StructProtocol record # pylint: disable=W0511 +wrapper = _DictAsStruct(partition_type) +evaluator = visitors.expression_evaluator(partition_schema, partition_expr, self.case_sensitive) + +return lambda data_file: evaluator(wrapper.wrap(data_file.partition)) + +def plan_files(self) -> Iterator[ScanTask]: +snapshot = self.snapshot() +if not snapshot: +return + +io = self.table.io + +# step 1: filter manifests using partition summaries +# the filter depends on the partition spec used to write the manifest file, so create a cache of filters for each spec id + +manifest_evaluators: Dict[int, Callable[[ManifestFile], bool]] = KeyDefaultDict(self._build_manifest_evaluator) + +manifests = [ +manifest_file +for manifest_file in snapshot.manifests(io) +if manifest_evaluators[manifest_file.partition_spec_id](manifest_file) +] + +# step 2: filter the data files in each manifest +# this filter depends on the partition spec used to write the manifest file + +partition_evaluators: Dict[int, Callable[[DataFile], bool]] = KeyDefaultDict(self._build_partition_evaluator) Review Comment: I like this, very elegant! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsub
[GitHub] [iceberg] rdblue merged pull request #6233: Python: Implement DataScan.plan_files
rdblue merged PR #6233: URL: https://github.com/apache/iceberg/pull/6233 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on pull request #6233: Python: Implement DataScan.plan_files
rdblue commented on PR #6233: URL: https://github.com/apache/iceberg/pull/6233#issuecomment-1322733138 Thanks, @Fokko! I'll follow up with testing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6128: Python: Projection
rdblue commented on code in PR #6128: URL: https://github.com/apache/iceberg/pull/6128#discussion_r1028599283 ## python/pyiceberg/expressions/__init__.py: ## @@ -601,36 +640,65 @@ def __eq__(self, other): def __repr__(self) -> str: return f"{str(self.__class__.__name__)}(term={repr(self.term)}, literal={repr(self.literal)})" +@property +@abstractmethod +def as_unbound(self) -> Type[LiteralPredicate[L]]: +... + class BoundEqualTo(BoundLiteralPredicate[L]): def __invert__(self) -> BoundNotEqualTo[L]: return BoundNotEqualTo[L](self.term, self.literal) +@property +def as_unbound(self) -> Type[EqualTo[L]]: +return EqualTo + class BoundNotEqualTo(BoundLiteralPredicate[L]): def __invert__(self) -> BoundEqualTo[L]: return BoundEqualTo[L](self.term, self.literal) +@property +def as_unbound(self) -> Type[NotEqualTo[L]]: +return NotEqualTo Review Comment: Does this need the type param, `NotEqualTo[L]`? It looks like `GreaterThan` has it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6128: Python: Projection
rdblue commented on code in PR #6128: URL: https://github.com/apache/iceberg/pull/6128#discussion_r1028602287 ## python/pyiceberg/transforms.py: ## @@ -644,8 +748,88 @@ def can_transform(self, _: IcebergType) -> bool: def result_type(self, source: IcebergType) -> IcebergType: return source +def project(self, name: str, pred: BoundPredicate[L]) -> Optional[UnboundPredicate[Any]]: +return None + def to_human_string(self, _: IcebergType, value: Optional[S]) -> str: return "null" def __repr__(self) -> str: return "VoidTransform()" + + +def _truncate_number( +name: str, pred: BoundLiteralPredicate[L], transform: Callable[[Optional[L]], Optional[L]] +) -> Optional[UnboundPredicate[Any]]: +boundary = pred.literal + +if not isinstance(boundary, (LongLiteral, DecimalLiteral, DateLiteral, TimestampLiteral)): +raise ValueError(f"Expected a numeric literal, got: {type(boundary)}") + +if isinstance(pred, BoundLessThan): +return LessThanOrEqual(Reference(name), _transform_literal(transform, boundary.decrement())) # type: ignore +elif isinstance(pred, BoundLessThanOrEqual): +return LessThanOrEqual(Reference(name), _transform_literal(transform, boundary)) +elif isinstance(pred, BoundGreaterThan): +return GreaterThanOrEqual(Reference(name), _transform_literal(transform, boundary.increment())) # type: ignore +elif isinstance(pred, BoundGreaterThanOrEqual): +return GreaterThanOrEqual(Reference(name), _transform_literal(transform, boundary)) +elif isinstance(pred, BoundEqualTo): +return EqualTo(Reference(name), _transform_literal(transform, boundary)) +else: +return None + + +def _truncate_array( +name: str, pred: BoundLiteralPredicate[L], transform: Callable[[Optional[L]], Optional[L]] +) -> Optional[UnboundPredicate[Any]]: +boundary = pred.literal + +if type(pred) in {BoundLessThan, BoundLessThanOrEqual}: +return LessThanOrEqual(Reference(name), _transform_literal(transform, boundary)) +elif type(pred) in {BoundGreaterThan, BoundGreaterThanOrEqual}: +return GreaterThanOrEqual(Reference(name), _transform_literal(transform, boundary)) +if isinstance(pred, BoundEqualTo): +return EqualTo(Reference(name), _transform_literal(transform, boundary)) +else: +return None + + +def _project_transform_predicate( +transform: Transform[Any, Any], partition_name: str, pred: BoundPredicate[L] +) -> Optional[UnboundPredicate[Any]]: +term = pred.term +if isinstance(term, BoundTransform) and transform == term.transform: +return _remove_transform(partition_name, pred) +return None + + +def _remove_transform(partition_name: str, pred: BoundPredicate[L]): +if isinstance(pred, BoundUnaryPredicate): +return pred.as_unbound(Reference(partition_name)) +elif isinstance(pred, BoundLiteralPredicate): +return pred.as_unbound(Reference(partition_name), pred.literal) +elif isinstance(pred, (BoundIn, BoundNotIn)): +return pred.as_unbound(Reference(partition_name), pred.literals) +else: +raise ValueError(f"Cannot replace transform in unknown predicate: {pred}") + + +def _set_apply_transform( +name: str, pred: BoundSetPredicate[L], transform: Callable[[L], L] +) -> UnboundPredicate[Any]: +literals = pred.literals +if isinstance(pred, BoundSetPredicate): +return pred.as_unbound(Reference(name), {_transform_literal(transform, literal) for literal in literals}) +else: +raise ValueError(f"Unknown BoundSetPredicate: {pred}") Review Comment: Nit: in this case, it isn't a `BoundSetPredicate` because `isinstance` returned false. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6128: Python: Projection
rdblue commented on code in PR #6128: URL: https://github.com/apache/iceberg/pull/6128#discussion_r1028604963 ## python/pyiceberg/transforms.py: ## @@ -644,8 +751,89 @@ def can_transform(self, _: IcebergType) -> bool: def result_type(self, source: IcebergType) -> IcebergType: return source +def project(self, name: str, pred: BoundPredicate) -> Optional[UnboundPredicate]: +return None + def to_human_string(self, _: IcebergType, value: Optional[S]) -> str: return "null" def __repr__(self) -> str: return "VoidTransform()" + + +M = TypeVar("M") + + +def _truncate_number( +name: str, pred: BoundLiteralPredicate, transform: Callable[[Optional[M]], Optional[M]] +) -> Optional[UnboundPredicate]: +boundary = pred.literal + +if not isinstance(boundary, (LongLiteral, DecimalLiteral, DateLiteral, TimeLiteral, TimestampLiteral)): +raise ValueError(f"Expected a numeric literal, got: {type(boundary)}") + +if isinstance(pred, BoundLessThan): +return LessThanOrEqual(Reference(name), _transform_literal(transform, boundary.decrement())) +elif isinstance(pred, BoundLessThanOrEqual): +return LessThanOrEqual(Reference(name), _transform_literal(transform, boundary)) +elif isinstance(pred, BoundGreaterThan): +return GreaterThanOrEqual(Reference(name), _transform_literal(transform, boundary.increment())) +elif isinstance(pred, BoundGreaterThanOrEqual): +return GreaterThanOrEqual(Reference(name), _transform_literal(transform, boundary)) +elif isinstance(pred, BoundEqualTo): +return EqualTo(Reference(name), _transform_literal(transform, boundary)) +else: +return None + + +def _truncate_array( +name: str, pred: BoundLiteralPredicate, func: Callable[[Optional[M]], Optional[M]] +) -> Optional[UnboundPredicate]: +boundary = pred.literal + +if type(pred) in {BoundLessThan, BoundLessThanOrEqual}: +return LessThanOrEqual(Reference(name), _transform_literal(func, boundary)) +elif type(pred) in {BoundGreaterThan, BoundGreaterThanOrEqual}: +return GreaterThanOrEqual(Reference(name), _transform_literal(func, boundary)) +if isinstance(pred, BoundEqualTo): +return EqualTo(Reference(name), _transform_literal(func, boundary)) +else: +return None + + +def _project_transform_predicate(transform: Transform, partition_name: str, pred: BoundPredicate) -> Optional[UnboundPredicate]: +term = pred.term +if isinstance(term, BoundTransform) and transform == term.transform: +return _remove_transform(partition_name, pred) +return None + + +def _remove_transform(partition_name: str, pred: BoundPredicate): +if isinstance(pred, BoundUnaryPredicate): +return pred.as_unbound(Reference(partition_name)) +elif isinstance(pred, BoundLiteralPredicate): +return pred.as_unbound(Reference(partition_name), pred.literal) +elif isinstance(pred, (BoundIn, BoundNotIn)): +return pred.as_unbound(Reference(partition_name), pred.literals) +else: +raise ValueError(f"Cannot replace transform in unknown predicate: {pred}") + + +def _transform_set(name: str, pred: BoundSetPredicate, func: Callable[[Optional[M]], Optional[M]]) -> UnboundPredicate: +literals = pred.literals +if isinstance(pred, BoundIn): +return In(Reference(name), {_transform_literal(func, literal) for literal in literals}) +elif isinstance(pred, BoundIn): +return NotIn(Reference(name), {_transform_literal(func, literal) for literal in literals}) +else: +raise ValueError(f"Unknown BoundSetPredicate: {pred}") + + +class BoundTransform(BoundTerm[T]): Review Comment: Strange. I don't see an import for expressions. In any case, we can fix it later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6128: Python: Projection
rdblue commented on code in PR #6128: URL: https://github.com/apache/iceberg/pull/6128#discussion_r1028605806 ## python/pyiceberg/transforms.py: ## @@ -511,6 +590,31 @@ def preserves_order(self) -> bool: def source_type(self) -> IcebergType: return self._source_type +def project(self, name: str, pred: BoundPredicate) -> Optional[UnboundPredicate]: +field_type = pred.term.ref().field.field_type + +if isinstance(pred.term, BoundTransform): +return _project_transform_predicate(self, name, pred) + +# Implement startswith and notstartswith for string (and probably binary) +# https://github.com/apache/iceberg/issues/6112 + +if isinstance(pred, BoundUnaryPredicate): +return pred.as_unbound(Reference(name)) + +if isinstance(field_type, (IntegerType, LongType, DecimalType)): +if isinstance(pred, BoundLiteralPredicate): +return _truncate_number(name, pred, self.transform(field_type)) +elif isinstance(pred, BoundIn): +return _transform_set(name, pred, self.transform(field_type)) Review Comment: Yeah, I just meant for the inner `elif` case, where both call `_set_apply_transform`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on pull request #6128: Python: Projection
rdblue commented on PR #6128: URL: https://github.com/apache/iceberg/pull/6128#issuecomment-1322801754 I fixed a formatting issue and tests are passing! Overall this looks great with only a couple nits left, so I'll merge it. Thanks for getting this in, @Fokko! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue merged pull request #6128: Python: Projection
rdblue merged PR #6128: URL: https://github.com/apache/iceberg/pull/6128 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on pull request #6069: Python: TableScan Plan files API implementation without residual evaluation
rdblue commented on PR #6069: URL: https://github.com/apache/iceberg/pull/6069#issuecomment-1322809905 @dhruv-pratap, we've been working on the list lately and I think the remaining items are: (4) add `MetricsEvalVisitor` to prune by column stats, (5) add `ResidualEvalVisitor` to produce residuals (LOW priority), and (8) expose `plan_files` via the CLI. For the last one, we may just want to add filters to the existing `files` command? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] dhruv-pratap commented on pull request #6069: Python: TableScan Plan files API implementation without residual evaluation
dhruv-pratap commented on PR #6069: URL: https://github.com/apache/iceberg/pull/6069#issuecomment-1322841113 > @dhruv-pratap, we've been working on the list lately and I think the remaining items are: (4) add `MetricsEvalVisitor` to prune by column stats, (5) add `ResidualEvalVisitor` to produce residuals (LOW priority), and (8) expose `plan_files` via the CLI. > > For the last one, we may just want to add filters to the existing `files` command? @rdblue We discussed this more in [this](https://github.com/apache/iceberg/pull/6069#discussion_r1012044761) thread. We can extend `files` with three options: * `files --snapshot=all` - The current files behavior * `files --snapshot=current` - Files under the current snapshot * `files --snapshot=` - Files under the snapshot_id specified Adding `expression` option would be a bigger undertaking though, as it would require an expression parser. Not sure if we want to prioritize that at this point for the CLI. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] emkornfield commented on issue #644: Views on top of Iceberg tables
emkornfield commented on issue #644: URL: https://github.com/apache/iceberg/issues/644#issuecomment-1322849306 What is the current state of Views, in general, it seems like there has been development effort here but I didn't see a vote on the mailing list officially adopting the specification (could just be my poor search abilities). Is the specification formally ratified? If not, what are the milestones before a vote happens? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] luoyuxia commented on issue #6104: Rewrite iceberg small files with flink succeeds but no snapshot is generated (V2 - upsert model)
luoyuxia commented on issue #6104: URL: https://github.com/apache/iceberg/issues/6104#issuecomment-1322899747 > This means that in the CDC data that is streaming to Iceberg, don't have a viable data compression scheme for data streams that contain delete operations at this stage? Yes, I'm afraid so. > Here's a question,is it possible to pause the writer for data compression once, and when the data compression is completed, resume the data writing from the checkpoint again, and handle the above commit exception by cyclically suspending, compressing, and writing again? I think it's possible. From the code, IIUC, the exception happens when you start a compression, but it find delete files before finish compression. And once it won't produce any deletes files between start and finish a compression, there shouldn't throw exception. But I'm not sure, you can have a try and to see whether it works. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hameizi commented on pull request #6237: Core: Fix check is delete file and data file overlap
hameizi commented on PR #6237: URL: https://github.com/apache/iceberg/pull/6237#issuecomment-1322908600 > I think the current judgment has already dealt with this situation, IIUC, deletes and data wil **not overlap** if: `(dataLower > deleteUpper) || (deleteLower > dataUpper)` So they **will overlap** when `!((dataLower > deleteUpper) || (deleteLower > dataUpper)) ` -> `(dataLower <= deleteUpper) && (deleteLower <= dataUpper)` which is exactly the current judgment. you are right, i will close this pr. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hameizi closed pull request #6237: Core: Fix check is delete file and data file overlap
hameizi closed pull request #6237: Core: Fix check is delete file and data file overlap URL: https://github.com/apache/iceberg/pull/6237 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] SHuixo commented on issue #6104: Rewrite iceberg small files with flink succeeds but no snapshot is generated (V2 - upsert model)
SHuixo commented on issue #6104: URL: https://github.com/apache/iceberg/issues/6104#issuecomment-1322917998 Yes, we tried it and found it worked. The following two log messages are the results of the compression test. > [compact-data-when-stream-write.log](https://github.com/apache/iceberg/files/10057144/compact-data-when-stream-write.log) > [compact-data-when-write-finished.log](https://github.com/apache/iceberg/files/10057146/compact-data-when-write-finish.log) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] SHuixo closed issue #6104: Rewrite iceberg small files with flink succeeds but no snapshot is generated (V2 - upsert model)
SHuixo closed issue #6104: Rewrite iceberg small files with flink succeeds but no snapshot is generated (V2 - upsert model) URL: https://github.com/apache/iceberg/issues/6104 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6239: Docs: Select the right Spark catalog
ajantha-bhat commented on code in PR #6239: URL: https://github.com/apache/iceberg/pull/6239#discussion_r1028370385 ## docs/aws.md: ## @@ -68,6 +68,7 @@ done # start Spark SQL client shell spark-sql --packages $DEPENDENCIES \ +--conf spark.sql.defaultCatalog=my_catalog \ Review Comment: In the same file there are multiple `spark-sql` commands, we need to add to all of these and also check in the other files? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6217: Core: Fix deletion of old metadata files when METADATA_DELETE_AFTER_COMMIT_ENABLED is set
amogh-jahagirdar commented on code in PR #6217: URL: https://github.com/apache/iceberg/pull/6217#discussion_r1028798630 ## core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java: ## @@ -414,16 +414,15 @@ private void deleteRemovedMetadataFiles(TableMetadata base, TableMetadata metada TableProperties.METADATA_DELETE_AFTER_COMMIT_ENABLED_DEFAULT); if (deleteAfterCommit) { - Set removedPreviousMetadataFiles = - Sets.newHashSet(base.previousFiles()); - removedPreviousMetadataFiles.removeAll(metadata.previousFiles()); - Tasks.foreach(removedPreviousMetadataFiles) + Set filesToDelete = Sets.newHashSet(base.previousFiles()); + filesToDelete.addAll(metadata.previousFiles()); + Tasks.foreach(filesToDelete) .noRetry() .suppressFailureWhenFinished() .onFailure( - (previousMetadataFile, exc) -> + (metadataFileToDelete, exc) -> LOG.warn( - "Delete failed for previous metadata file: {}", previousMetadataFile, exc)) + "Delete failed for previous metadata file: {}", metadataFileToDelete, exc)) .run(previousMetadataFile -> io().deleteFile(previousMetadataFile.file())); Review Comment: Should this use batch delete if applicable? I don't think it's a blocker though, something we can get in a follow on if it's too intrusive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] arunb2w opened a new issue, #6244: Iceberg metadata not stored properly
arunb2w opened a new issue, #6244: URL: https://github.com/apache/iceberg/issues/6244 I tried creating an sample iceberg table with below schema ``` CREATE TABLE glue_dev.db.datatype_test ( id bigint, data string, category string ) USING iceberg TBLPROPERTIES ('read.split.target-size'='134217728', "write.metadata.metrics.default"="full") ``` Then, inserted around 100 records and then did rewrite files after that so that all the inserted data will be rewritten in a single file. ``` for num in range(1,100) INSERT INTO glue_dev.db.datatype_test VALUES ({num}, 'data{num}', 'catagory{num}') ``` After that I tried to query the data_files metadata. ``` select * from glue_dev.db.datatype_test.data_files limit 10; content file_path file_format spec_id record_count file_size_in_bytes column_sizesvalue_countsnull_value_counts nan_value_countslower_boundsupper_boundskey_metadata split_offsets equality_idssort_order_id 0 s3://bucket/folder/db.db/datatype_test/data/0-0-4bb9ce80-c7a7-4192-98c6-ed6e7289a981-1.parquet PARQUET 0 559 4234{1:991,2:1188,3:1258} {1:559,2:559,3:559} {1:0,2:0,3:0} {} {1:,2:data1,3:catagory1} {1:/,2:data99,3:catagory99} NULL[4] NULL0 Time taken: 15.614 seconds, Fetched 1 row(s) ``` In this metadata, if you see the lower_bounds and upper_bounds data for the id column which is of type bigint they dont represent the correct values. Does that mean iceberg is not storing the metadata correctly? In this case if i join using id column how iceberg will properly scan/skip files since the lower bound and upper bound metadata is not correct? This is a test data but i have actual data with around 1 files and they too exhibit similar behaviour for integer columns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] lirui-apache commented on pull request #5206: Core: Defer reading Avro metadata until ManifestFile is read
lirui-apache commented on PR #5206: URL: https://github.com/apache/iceberg/pull/5206#issuecomment-1323119574 We encountered authorization failures reading manifest files after applied this PR, and thought it might be related. Since the worker pool in use is by default a global static pool, the threads in this pool may not be able to see the correct UGI submitting the operation. And hit permission denied errors when trying to open the stream. I wonder whether we should call `doAs` in the pool, or whether we should let different users use separate pools? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] pvary commented on issue #6241: Updating the HiveCatalog Conf with setConf does not also reset FileIO
pvary commented on issue #6241: URL: https://github.com/apache/iceberg/issues/6241#issuecomment-1323146232 With the current `Catalog` implementations it is probably better to recreate the catalogs anyway, as they are mostly a thin wrapper above the HMSClientPool. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on issue #6003: Vectorized Read
nastra commented on issue #6003: URL: https://github.com/apache/iceberg/issues/6003#issuecomment-1323178266 This problem is specific to Arrow itself when running with JDK9+, because Arrow's `MemoryUtil` requires access to the mentioned Java module. See https://arrow.apache.org/docs/java/install.html#java-compatibility / https://github.com/apache/arrow/commit/0310add1776054a67aacfb1a7bc0f569a19c7628#diff-23f1adfa46ef0fe69d6902b25cd417114b089d692d0c313d3a9a2c8ceff7cd3d on how to resolve this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] psnilesh opened a new issue, #6245: For for issue #2796 is missing from 0.14.1 and 1.0.x releases
psnilesh opened a new issue, #6245: URL: https://github.com/apache/iceberg/issues/6245 ### Apache Iceberg version 1.0.0 (latest release) ### Query engine _No response_ ### Please describe the bug ๐ Issue is same as https://github.com/apache/iceberg/issues/2796. I see this is already fixed by @jfz and merged into `master` but I don't see the fix in either `0.14.1` or `1.0.0` branch. Is it a miss or deliberately not included for some reason ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on pull request #6169: AWS,Core: Add S3 REST Signer client + REST Spec
nastra commented on PR #6169: URL: https://github.com/apache/iceberg/pull/6169#issuecomment-1323218731 I have moved the S3 Signer REST Spec to the `iceberg-aws` module. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on issue #6245: Fix for issue #2796 is missing from 0.14.1 and 1.0.x releases
ajantha-bhat commented on issue #6245: URL: https://github.com/apache/iceberg/issues/6245#issuecomment-1323226313 0.14.1 and 1.0.0 release is not from the master branch. It was based on 0.14.0 branch. The fix that you are expecting will be available in the upcoming 1.1.0 release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on pull request #6240: Nessie: Refactor NessieTableOperations#doCommit
nastra commented on PR #6240: URL: https://github.com/apache/iceberg/pull/6240#issuecomment-1323227993 cc: @dimas-b / @snazy can you guys review this one please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] psnilesh commented on issue #6245: Fix for issue #2796 is missing from 0.14.1 and 1.0.x releases
psnilesh commented on issue #6245: URL: https://github.com/apache/iceberg/issues/6245#issuecomment-1323228826 Thank you. When can I expect 1.1.0 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on issue #6244: Iceberg metadata not stored properly
ajantha-bhat commented on issue #6244: URL: https://github.com/apache/iceberg/issues/6244#issuecomment-1323230663 > Does that mean iceberg is not storing the metadata correctly? lower bounds and upper bounds are stored as byte arrays in the manifest files. Hence, what you are seeing is the bytearray string representation which is not human readable. But during file pruning these values are converted back to proper data types and compared. We can enhance these metadata tables to show actual values. I believe @szehon-ho has an open PR for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on issue #6244: Iceberg metadata not stored properly
ajantha-bhat commented on issue #6244: URL: https://github.com/apache/iceberg/issues/6244#issuecomment-1323232753 As a workaround, I believe https://github.com/hililiwei/iceberg-tools#manifest2json can convert them and show them. cc: @hililiwei -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on issue #6245: Fix for issue #2796 is missing from 0.14.1 and 1.0.x releases
ajantha-bhat commented on issue #6245: URL: https://github.com/apache/iceberg/issues/6245#issuecomment-1323233889 It is already open for voting. So within 3 to 4 days, the release will be available. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko merged pull request #6211: Allow dropping a column used by old SortOrders but not current SortOrder
Fokko merged PR #6211: URL: https://github.com/apache/iceberg/pull/6211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #6204: Allow dropping a column used by old SortOrders
Fokko commented on issue #6204: URL: https://github.com/apache/iceberg/issues/6204#issuecomment-1323246249 Thanks for double checking this ๐๐ป -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko closed issue #6204: Allow dropping a column used by old SortOrders
Fokko closed issue #6204: Allow dropping a column used by old SortOrders URL: https://github.com/apache/iceberg/issues/6204 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on pull request #6240: Nessie: Refactor NessieTableOperations#doCommit
ajantha-bhat commented on PR #6240: URL: https://github.com/apache/iceberg/pull/6240#issuecomment-1323253673 @nastra: Review from Iceberg side is easy. But I would like to know, can Trino really use it? Looks to me that `client#commitTable` needs the below parameters. ``` commitTable( TableMetadata base, TableMetadata metadata, String newMetadataLocation, IcebergTable expectedContent, ContentKey expectedContent) ``` But the Trino interface (`NessieIcebergTableOperations#commitNewTable`) currently doesn't have `base`, `expectedContent` and these interfaces are common for other catalogs too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org