[GitHub] [iceberg] yuangjiang opened a new issue, #6236: aused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 3072 bytes

2022-11-21 Thread GitBox


yuangjiang opened a new issue, #6236:
URL: https://github.com/apache/iceberg/issues/6236

   ### Apache Iceberg version
   
   main (development)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug ๐Ÿž
   
   Iceberg spark cannot create a table using jdbc catalog, prompting that 
catalog initialization failed
   
   Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: 
Specified key was too long; max key length is 3072 bytes
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
at com.mysql.jdbc.Util.getInstance(Util.java:408)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3978)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3914)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2530)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2683)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2495)
at 
com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1903)
at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1242)
at 
org.apache.iceberg.jdbc.JdbcCatalog.lambda$initializeCatalogTables$1(JdbcCatalog.java:152)
at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:58)
at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
at 
org.apache.iceberg.jdbc.JdbcCatalog.initializeCatalogTables(JdbcCatalog.java:135)
at org.apache.iceberg.jdbc.JdbcCatalog.initialize(JdbcCatalog.java:103)
... 87 more
   org.apache.iceberg.jdbc.UncheckedSQLException: Cannot initialize JDBC catalog
at org.apache.iceberg.jdbc.JdbcCatalog.initialize(JdbcCatalog.java:109)
at org.apache.iceberg.CatalogUtil.loadCatalog(CatalogUtil.java:212)
at 
org.apache.iceberg.CatalogUtil.buildIcebergCatalog(CatalogUtil.java:254)
at 
org.apache.iceberg.spark.SparkCatalog.buildIcebergCatalog(SparkCatalog.java:125)
at 
org.apache.iceberg.spark.SparkCatalog.initialize(SparkCatalog.java:470)
at 
org.apache.spark.sql.connector.catalog.Catalogs$.load(Catalogs.scala:60)
at 
org.apache.spark.sql.connector.catalog.CatalogManager.$anonfun$catalog$1(CatalogManager.scala:52)
at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
at 
org.apache.spark.sql.connector.catalog.CatalogManager.catalog(CatalogManager.scala:52)
at 
org.apache.spark.sql.connector.catalog.LookupCatalog$CatalogAndIdentifier$.unapply(LookupCatalog.scala:123)
at 
org.apache.spark.sql.connector.catalog.LookupCatalog$NonSessionCatalogAndIdentifier$.unapply(LookupCatalog.scala:73)
at 
org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$NonSessionCatalogAndTable$.unapply(ResolveCatalogs.scala:99)
at 
org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:35)
at 
org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:33)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$2(AnalysisHelper.scala:170)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$1(AnalysisHelper.scala:170)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:323)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning(AnalysisHelper.scala:168)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning$(AnalysisHelper.scala:164)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning(AnalysisHelper.scala:99)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning$(AnalysisHelper.scala:96)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:76)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:75)
at 
org.apache.spark.s

[GitHub] [iceberg] yuangjiang commented on issue #6236: aused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 3072 bytes

2022-11-21 Thread GitBox


yuangjiang commented on issue #6236:
URL: https://github.com/apache/iceberg/issues/6236#issuecomment-1321628781

   My submit command is as follows
   
   bin/spark-sql --packages 
org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.0.0 \
   --conf spark.sql.catalog.prod=org.apache.iceberg.spark.SparkCatalog \
   --conf spark.sql.catalog.prod.warehouse=file://tmp/iceberg \
   --conf 
spark.sql.catalog.prod.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog \
   --conf spark.sql.catalog.prod.uri=jdbc:mysql://localhost:3306/iceberg \
   --conf spark.sql.catalog.prod.jdbc.verifyServerCertificate=false \
   --conf spark.sql.catalog.prod.jdbc.useSSL=false \
   --conf spark.sql.catalog.prod.jdbc.user=iceberg \
   --conf spark.sql.catalog.prod.jdbc.password=password


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] hameizi opened a new pull request, #6237: Core: Fix check is delete file and data file overlap

2022-11-21 Thread GitBox


hameizi opened a new pull request, #6237:
URL: https://github.com/apache/iceberg/pull/6237

   Just `deleteLower` and `deleteUpper` less than `dataLower ` is true or 
`deleteLower` and `deleteUpper` greater than `dataUpper` is true mean there is 
no overlap between the delete-file and data-file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] hameizi commented on pull request #6237: Core: Fix check is delete file and data file overlap

2022-11-21 Thread GitBox


hameizi commented on PR #6237:
URL: https://github.com/apache/iceberg/pull/6237#issuecomment-1321652265

   @rdblue can you take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] yuangjiang commented on issue #6236: Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 3072 bytes

2022-11-21 Thread GitBox


yuangjiang commented on issue #6236:
URL: https://github.com/apache/iceberg/issues/6236#issuecomment-1321699076

   CREATE TABLE `iceberg_namespace_properties` (
 `catalog_name` varchar(255) NOT NULL,
 `namespace` varchar(255) NOT NULL,
 `property_key` varchar(255) NOT NULL,
 `property_value` varchar(5500) DEFAULT NULL,
 PRIMARY KEY (`catalog_name`,`namespace`,`property_key`)
   ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
   
   When I modify the following properties, is the NAMESPACE_PROPERTY_KEY in the 
normal CREATE_NAMESPACE_PROPERTIES_TABLE set too large?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] nastra commented on pull request #6238: Nessie: Make UpdateableReference public

2022-11-21 Thread GitBox


nastra commented on PR #6238:
URL: https://github.com/apache/iceberg/pull/6238#issuecomment-1321831628

   I think it would be actually better if `NessieIcebergClient` would have a 
re-usable commit operation rathern than exposing this class here. I talked to 
@ajantha-bhat and he'll add it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] nastra closed pull request #6238: Nessie: Make UpdateableReference public

2022-11-21 Thread GitBox


nastra closed pull request #6238: Nessie: Make UpdateableReference public
URL: https://github.com/apache/iceberg/pull/6238


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko merged pull request #6212: Replace ImmutableMap.Builder.build() with buildOrThrow()

2022-11-21 Thread GitBox


Fokko merged PR #6212:
URL: https://github.com/apache/iceberg/pull/6212


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko merged pull request #6228: Python: Minor fixes to expression types

2022-11-21 Thread GitBox


Fokko merged PR #6228:
URL: https://github.com/apache/iceberg/pull/6228


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] zhongyujiang commented on pull request #6237: Core: Fix check is delete file and data file overlap

2022-11-21 Thread GitBox


zhongyujiang commented on PR #6237:
URL: https://github.com/apache/iceberg/pull/6237#issuecomment-1321994998

   I think the current judgment has already  dealt with this situation, IIUC, 
deletes and data wil **not overlap** if:
   `(dataLower > deleteUpper) || (deleteLower > dataUpper)`
   So they **will overlap** when 
   `!((dataLower > deleteUpper) || (deleteLower > dataUpper)) `
   ->
   `(dataLower <= deleteUpper) && (deleteLower <= dataUpper)`
   which is exactly the current judgment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] pvary commented on pull request #6175: Hive: Add UGI to the key in CachedClientPool

2022-11-21 Thread GitBox


pvary commented on PR #6175:
URL: https://github.com/apache/iceberg/pull/6175#issuecomment-1322002592

   @flyrain: The Catalog is a strange animal. We basically expect it to be a 
short lived and easy/cheap to drop and recreate. I had a similar discussion [1] 
on another PR around this. I Hive code we definitely create Catalog objects 
with considering the costs negligible.
   
   I feel that we bump into the issue of closing/keeping the catalogs more and 
more, and the current working is kind of non-intuitive, but changing this would 
be a serious impact so we should consider carefully.
   
   In the meantime I would support the idea of the configurable cache.
   
   [1] https://github.com/apache/iceberg/pull/5166
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


Fokko commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1027642501


##
python/pyiceberg/table/__init__.py:
##
@@ -138,7 +157,10 @@ def __eq__(self, other: Any) -> bool:
 )
 
 
-class TableScan:
+S = TypeVar("S", bound="TableScan", covariant=True)

Review Comment:
   Instead of introducing a generic, we can also return `Self` which feels more 
Pythonic to me: https://peps.python.org/pep-0673/



##
python/pyiceberg/table/__init__.py:
##
@@ -199,16 +223,143 @@ def use_ref(self, name: str):
 
 raise ValueError(f"Cannot scan unknown ref={name}")
 
-def select(self, *field_names: str) -> TableScan:
+def select(self, *field_names: str) -> S:
 if "*" in self.selected_fields:
 return self.update(selected_fields=field_names)
 return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names
 
-def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan:
+def filter_rows(self, new_row_filter: BooleanExpression) -> S:
 return self.update(row_filter=And(self.row_filter, new_row_filter))
 
-def filter_partitions(self, new_partition_filter: BooleanExpression) -> 
TableScan:
+def filter_partitions(self, new_partition_filter: BooleanExpression) -> S:
 return self.update(partition_filter=And(self.partition_filter, 
new_partition_filter))
 
-def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan:
+def with_case_sensitive(self, case_sensitive: bool = True) -> S:
 return self.update(case_sensitive=case_sensitive)
+
+
+class ScanTask(ABC):
+pass
+
+
+@dataclass(init=False)

Review Comment:
   Why the `dataclass` here? Should we make it at least frozen?



##
python/pyiceberg/table/__init__.py:
##
@@ -199,16 +223,143 @@ def use_ref(self, name: str):
 
 raise ValueError(f"Cannot scan unknown ref={name}")
 
-def select(self, *field_names: str) -> TableScan:
+def select(self, *field_names: str) -> S:
 if "*" in self.selected_fields:
 return self.update(selected_fields=field_names)
 return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names
 
-def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan:
+def filter_rows(self, new_row_filter: BooleanExpression) -> S:
 return self.update(row_filter=And(self.row_filter, new_row_filter))
 
-def filter_partitions(self, new_partition_filter: BooleanExpression) -> 
TableScan:
+def filter_partitions(self, new_partition_filter: BooleanExpression) -> S:
 return self.update(partition_filter=And(self.partition_filter, 
new_partition_filter))
 
-def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan:
+def with_case_sensitive(self, case_sensitive: bool = True) -> S:
 return self.update(case_sensitive=case_sensitive)
+
+
+class ScanTask(ABC):
+pass
+
+
+@dataclass(init=False)
+class FileScanTask(ScanTask):
+file: DataFile
+start: int
+length: int
+
+def __init__(self, data_file: DataFile, start: Optional[int] = None, 
length: Optional[int] = None):
+self.file = data_file
+self.start = start or 0
+self.length = length or data_file.file_size_in_bytes
+
+
+class _DictAsStruct(StructProtocol):
+pos_to_name: dict[int, str]
+wrapped: dict[str, Any]
+
+def __init__(self, partition_type: StructType):
+self.pos_to_name = {}
+for pos, field in enumerate(partition_type.fields):
+self.pos_to_name[pos] = field.name
+
+def wrap(self, to_wrap: dict[str, Any]) -> _DictAsStruct:
+self.wrapped = to_wrap
+return self
+
+def get(self, pos: int) -> Any:
+return self.wrapped[self.pos_to_name[pos]]
+
+def set(self, pos: int, value: Any) -> None:
+raise NotImplementedError("Cannot set values in DictAsStruct")
+
+
+class DataScan(TableScan["DataScan"]):
+def __init__(
+self,
+table: Table,
+row_filter: Optional[BooleanExpression] = None,
+partition_filter: Optional[BooleanExpression] = None,
+selected_fields: Tuple[str] = ("*",),
+case_sensitive: bool = True,
+snapshot_id: Optional[int] = None,
+options: Properties = EMPTY_DICT,
+):
+super().__init__(table, row_filter, partition_filter, selected_fields, 
case_sensitive, snapshot_id, options)
+
+def plan_files(self) -> Iterator[ScanTask]:
+snapshot = self.snapshot()
+if not snapshot:
+return ()
+
+io = self.table.io
+
+# step 1: filter manifests using partition summaries
+# the filter depends on the partition spec used to write the manifest 
file, so create a cache of filters for each spec id
+
+@cache
+def manifest_filter(spec_id: int) -> Callable[[ManifestFile], bool]:

Review Commen

[GitHub] [iceberg] SHuixo commented on issue #6104: Rewrite iceberg small files with flink succeeds but no snapshot is generated (V2 - upsert model)

2022-11-21 Thread GitBox


SHuixo commented on issue #6104:
URL: https://github.com/apache/iceberg/issues/6104#issuecomment-1322093088

   > Yes, we have to wait it to be merged.
   
   Good, looking forward to the merger of this rockdb new feature.
   
   > Had a look about your exception log. The reason is the cdc contains a 
delete row, but the compaction for such case that contains delete files hasn't 
been supported.
   
   This means that in the CDC data that is streaming to Iceberg, don't have a 
viable data compression scheme for data streams that contain delete operations 
at this stage?
   
   In order to verify whether the **commit exception** in the flow scenario has 
a similar problem in the batch scenario, we made the following attempts: 
   
   ```
   1. Enable flink CDC streaming writing iceberg, and the checkpoint is 5 
minutes; 
   2. When the writer is running, start the compression program until a 
**commit exception** occurs; 
   3. When the above **commit exception** occurs, stop the CDC data writing 
program and compression program; 
   4. Turn on the data compression program again until the program is up and 
running. 
   ```
   
   > The following figure shows the start and end times when flink CDC writes 
data to iceberg๏ผš
   
   https://user-images.githubusercontent.com/20868410/203070483-d39c3107-ca61-4d31-bb30-7d63cf821697.PNG";>
   
   > Some of the logs are as follows:
   
   
[compact-data-when-stream-write.log](https://github.com/apache/iceberg/files/10056982/compact-data-when-stream-write.log)
   
[compact-data-when-write-finish.log](https://github.com/apache/iceberg/files/10056985/compact-data-when-write-finish.log)
   
   
   Here's a question,is it possible to pause the writer for data compression 
once, and when the data compression is completed, resume the data writing from 
the checkpoint again, and handle the above commit exception by cyclically 
suspending, compressing, and writing again?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


Fokko commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028076839


##
python/pyiceberg/table/__init__.py:
##
@@ -199,16 +223,143 @@ def use_ref(self, name: str):
 
 raise ValueError(f"Cannot scan unknown ref={name}")
 
-def select(self, *field_names: str) -> TableScan:
+def select(self, *field_names: str) -> S:
 if "*" in self.selected_fields:
 return self.update(selected_fields=field_names)
 return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names
 
-def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan:
+def filter_rows(self, new_row_filter: BooleanExpression) -> S:
 return self.update(row_filter=And(self.row_filter, new_row_filter))
 
-def filter_partitions(self, new_partition_filter: BooleanExpression) -> 
TableScan:
+def filter_partitions(self, new_partition_filter: BooleanExpression) -> S:
 return self.update(partition_filter=And(self.partition_filter, 
new_partition_filter))
 
-def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan:
+def with_case_sensitive(self, case_sensitive: bool = True) -> S:
 return self.update(case_sensitive=case_sensitive)
+
+
+class ScanTask(ABC):
+pass
+
+
+@dataclass(init=False)
+class FileScanTask(ScanTask):
+file: DataFile
+start: int
+length: int
+
+def __init__(self, data_file: DataFile, start: Optional[int] = None, 
length: Optional[int] = None):
+self.file = data_file
+self.start = start or 0
+self.length = length or data_file.file_size_in_bytes
+
+
+class _DictAsStruct(StructProtocol):
+pos_to_name: dict[int, str]
+wrapped: dict[str, Any]
+
+def __init__(self, partition_type: StructType):
+self.pos_to_name = {}
+for pos, field in enumerate(partition_type.fields):
+self.pos_to_name[pos] = field.name
+
+def wrap(self, to_wrap: dict[str, Any]) -> _DictAsStruct:
+self.wrapped = to_wrap
+return self
+
+def get(self, pos: int) -> Any:
+return self.wrapped[self.pos_to_name[pos]]
+
+def set(self, pos: int, value: Any) -> None:
+raise NotImplementedError("Cannot set values in DictAsStruct")
+
+
+class DataScan(TableScan["DataScan"]):
+def __init__(
+self,
+table: Table,
+row_filter: Optional[BooleanExpression] = None,
+partition_filter: Optional[BooleanExpression] = None,
+selected_fields: Tuple[str] = ("*",),
+case_sensitive: bool = True,
+snapshot_id: Optional[int] = None,
+options: Properties = EMPTY_DICT,
+):
+super().__init__(table, row_filter, partition_filter, selected_fields, 
case_sensitive, snapshot_id, options)
+
+def plan_files(self) -> Iterator[ScanTask]:
+snapshot = self.snapshot()
+if not snapshot:
+return ()
+
+io = self.table.io
+
+# step 1: filter manifests using partition summaries
+# the filter depends on the partition spec used to write the manifest 
file, so create a cache of filters for each spec id
+
+@cache
+def manifest_filter(spec_id: int) -> Callable[[ManifestFile], bool]:
+spec = self.table.specs()[spec_id]
+return visitors.manifest_evaluator(spec, self.table.schema(), 
self.partition_filter, self.case_sensitive)
+
+def partition_summary_filter(manifest_file: ManifestFile) -> bool:

Review Comment:
   Maybe we should also align on this as a community and stick with filter or 
list comprehension, where I would prefer the latter :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko opened a new pull request, #6239: Docs: Select the right Spark catalog

2022-11-21 Thread GitBox


Fokko opened a new pull request, #6239:
URL: https://github.com/apache/iceberg/pull/6239

   I copy pasted the example, but still had to select the catalog.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko merged pull request #6034: Python: GlueCatalog Full Implementation

2022-11-21 Thread GitBox


Fokko merged PR #6034:
URL: https://github.com/apache/iceberg/pull/6034


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] asheeshgarg commented on issue #6003: Vectorized Read

2022-11-21 Thread GitBox


asheeshgarg commented on issue #6003:
URL: https://github.com/apache/iceberg/issues/6003#issuecomment-1322337039

   @nastra 
   I have removed the spark dependency and just added raw 
   
 org.apache.iceberg
 iceberg-common
 1.0.0
   
   
 org.apache.iceberg
 iceberg-core
 1.0.0
   
   
 org.apache.iceberg
 iceberg-data
 1.0.0
   
   
 org.apache.iceberg
 iceberg-hive-metastore
 1.0.0
   
   
 org.apache.iceberg
 iceberg-parquet
 1.0.0
   
   
 org.apache.iceberg
 iceberg-arrow
 1.0.0
   
   
   
 org.apache.hadoop
 hadoop-common
 3.3.1
   
   
   
   I am getting java.lang.NoClassDefFoundError: org/apache/thrift/TException Do 
we have list of depedency that we need to set in order to read the datafiles 
from iceberg in Pure Java API without any processing Engine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028276873


##
python/pyiceberg/table/__init__.py:
##
@@ -138,7 +157,10 @@ def __eq__(self, other: Any) -> bool:
 )
 
 
-class TableScan:
+S = TypeVar("S", bound="TableScan", covariant=True)

Review Comment:
   Nice!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028278238


##
python/pyiceberg/table/__init__.py:
##
@@ -199,16 +223,143 @@ def use_ref(self, name: str):
 
 raise ValueError(f"Cannot scan unknown ref={name}")
 
-def select(self, *field_names: str) -> TableScan:
+def select(self, *field_names: str) -> S:
 if "*" in self.selected_fields:
 return self.update(selected_fields=field_names)
 return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names
 
-def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan:
+def filter_rows(self, new_row_filter: BooleanExpression) -> S:
 return self.update(row_filter=And(self.row_filter, new_row_filter))
 
-def filter_partitions(self, new_partition_filter: BooleanExpression) -> 
TableScan:
+def filter_partitions(self, new_partition_filter: BooleanExpression) -> S:
 return self.update(partition_filter=And(self.partition_filter, 
new_partition_filter))
 
-def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan:
+def with_case_sensitive(self, case_sensitive: bool = True) -> S:
 return self.update(case_sensitive=case_sensitive)
+
+
+class ScanTask(ABC):
+pass
+
+
+@dataclass(init=False)

Review Comment:
   This was to get repr for free. I didn't make it frozen because we reuse it 
to wrap each partition tuple.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028282275


##
python/pyiceberg/table/__init__.py:
##
@@ -199,16 +223,143 @@ def use_ref(self, name: str):
 
 raise ValueError(f"Cannot scan unknown ref={name}")
 
-def select(self, *field_names: str) -> TableScan:
+def select(self, *field_names: str) -> S:
 if "*" in self.selected_fields:
 return self.update(selected_fields=field_names)
 return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names
 
-def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan:
+def filter_rows(self, new_row_filter: BooleanExpression) -> S:
 return self.update(row_filter=And(self.row_filter, new_row_filter))
 
-def filter_partitions(self, new_partition_filter: BooleanExpression) -> 
TableScan:
+def filter_partitions(self, new_partition_filter: BooleanExpression) -> S:
 return self.update(partition_filter=And(self.partition_filter, 
new_partition_filter))
 
-def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan:
+def with_case_sensitive(self, case_sensitive: bool = True) -> S:
 return self.update(case_sensitive=case_sensitive)
+
+
+class ScanTask(ABC):
+pass
+
+
+@dataclass(init=False)
+class FileScanTask(ScanTask):
+file: DataFile
+start: int
+length: int
+
+def __init__(self, data_file: DataFile, start: Optional[int] = None, 
length: Optional[int] = None):
+self.file = data_file

Review Comment:
   This is `file` in the JVM version, which is why I went with that. The JVM 
one is way more complicated than we need it to be, so I'll have to think about 
how we want to structure the task classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028282699


##
python/pyiceberg/table/__init__.py:
##
@@ -199,16 +223,143 @@ def use_ref(self, name: str):
 
 raise ValueError(f"Cannot scan unknown ref={name}")
 
-def select(self, *field_names: str) -> TableScan:
+def select(self, *field_names: str) -> S:
 if "*" in self.selected_fields:
 return self.update(selected_fields=field_names)
 return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names
 
-def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan:
+def filter_rows(self, new_row_filter: BooleanExpression) -> S:
 return self.update(row_filter=And(self.row_filter, new_row_filter))
 
-def filter_partitions(self, new_partition_filter: BooleanExpression) -> 
TableScan:
+def filter_partitions(self, new_partition_filter: BooleanExpression) -> S:
 return self.update(partition_filter=And(self.partition_filter, 
new_partition_filter))
 
-def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan:
+def with_case_sensitive(self, case_sensitive: bool = True) -> S:
 return self.update(case_sensitive=case_sensitive)
+
+
+class ScanTask(ABC):
+pass
+
+
+@dataclass(init=False)
+class FileScanTask(ScanTask):
+file: DataFile
+start: int
+length: int
+
+def __init__(self, data_file: DataFile, start: Optional[int] = None, 
length: Optional[int] = None):
+self.file = data_file
+self.start = start or 0
+self.length = length or data_file.file_size_in_bytes
+
+
+class _DictAsStruct(StructProtocol):
+pos_to_name: dict[int, str]
+wrapped: dict[str, Any]
+
+def __init__(self, partition_type: StructType):
+self.pos_to_name = {}
+for pos, field in enumerate(partition_type.fields):
+self.pos_to_name[pos] = field.name
+
+def wrap(self, to_wrap: dict[str, Any]) -> _DictAsStruct:
+self.wrapped = to_wrap
+return self
+
+def get(self, pos: int) -> Any:
+return self.wrapped[self.pos_to_name[pos]]
+
+def set(self, pos: int, value: Any) -> None:
+raise NotImplementedError("Cannot set values in DictAsStruct")
+
+
+class DataScan(TableScan["DataScan"]):
+def __init__(
+self,
+table: Table,
+row_filter: Optional[BooleanExpression] = None,
+partition_filter: Optional[BooleanExpression] = None,
+selected_fields: Tuple[str] = ("*",),
+case_sensitive: bool = True,
+snapshot_id: Optional[int] = None,
+options: Properties = EMPTY_DICT,
+):
+super().__init__(table, row_filter, partition_filter, selected_fields, 
case_sensitive, snapshot_id, options)
+
+def plan_files(self) -> Iterator[ScanTask]:
+snapshot = self.snapshot()
+if not snapshot:
+return ()

Review Comment:
   I wasn't sure how to return an empty generator, so I just returned a tuple. 
Happy to update to something better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] nastra commented on issue #6003: Vectorized Read

2022-11-21 Thread GitBox


nastra commented on issue #6003:
URL: https://github.com/apache/iceberg/issues/6003#issuecomment-1322348315

   Can you try `iceberg-hive-runtime` rather than `iceberg-hive-metastore` 
(you'd also need to remove `iceberg-parquet` as that's shaded inside 
`iceberg-hive-runtime`)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028283685


##
python/pyiceberg/manifest.py:
##
@@ -141,6 +141,14 @@ def read_manifest_entry(input_file: InputFile) -> 
Iterator[ManifestEntry]:
 yield ManifestEntry(**dict_repr)
 
 
+def live_entries(input_file: InputFile) -> Iterator[ManifestEntry]:
+return filter(lambda entry: entry.status != ManifestEntryStatus.DELETED, 
read_manifest_entry(input_file))

Review Comment:
   I think we want the planning path to use a generator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6232: Python: Disallow Any generics

2022-11-21 Thread GitBox


rdblue commented on code in PR #6232:
URL: https://github.com/apache/iceberg/pull/6232#discussion_r1028308314


##
python/pyiceberg/expressions/literals.py:
##
@@ -207,7 +207,7 @@ def __init__(self, value: int):
 super().__init__(value, int)
 
 @singledispatchmethod
-def to(self, type_var: IcebergType) -> Literal:
+def to(self, type_var: IcebergType) -> Literal:  # type: ignore

Review Comment:
   Is there no way to return an unknown type from a method in Python? In Java, 
we would just declare that the method returns `Literal` where `T` comes from 
context. Then you get a type that may not be correct, but is at least not 
``.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] asheeshgarg commented on issue #6003: Vectorized Read

2022-11-21 Thread GitBox


asheeshgarg commented on issue #6003:
URL: https://github.com/apache/iceberg/issues/6003#issuecomment-1322426914

   @nastra 
 
 org.apache.iceberg
 iceberg-common
 1.0.0
   
   
 org.apache.iceberg
 iceberg-core
 1.0.0
   
   
 org.apache.iceberg
 iceberg-data
 1.0.0
   
   
 org.apache.iceberg
 iceberg-arrow
 1.0.0
   
   
 org.apache.iceberg
 iceberg-hive-runtime
 1.0.0
   
   
 org.apache.hadoop
 hadoop-common
 3.3.1
   
   
   still getting 
   java.lang.NoClassDefFoundError: org/apache/thrift/TException 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] ajantha-bhat opened a new pull request, #6240: Nessie: Refactor NessieTableOperations#doCommit

2022-11-21 Thread GitBox


ajantha-bhat opened a new pull request, #6240:
URL: https://github.com/apache/iceberg/pull/6240

   Move core logic from `NessieTableOperations#doCommit` to 
`NessieIcebergClient#commitTable` because Trino Nessie catalog integration 
(https://github.com/trinodb/trino/pull/11701) don't use Iceberg's 
`NessieTableOperation` directly. 
   Hence to avoid code duplication, move common logic to `NessieIcebergClient`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] ajantha-bhat commented on pull request #6240: Nessie: Refactor NessieTableOperations#doCommit

2022-11-21 Thread GitBox


ajantha-bhat commented on PR #6240:
URL: https://github.com/apache/iceberg/pull/6240#issuecomment-1322441438

   cc: @nastra 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] nastra commented on issue #6003: Vectorized Read

2022-11-21 Thread GitBox


nastra commented on issue #6003:
URL: https://github.com/apache/iceberg/issues/6003#issuecomment-1322441996

   NoClassDefFoundError generally means that this dependency existed at 
compilation time but doesn't exist at runtime, so you're missing the right 
dependency for that. 
   Can you provide a full stack trace and all dependencies that you're using? 
Or is the above list complete?
   Or do you have a direct link to this project where this issue happens?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] nastra commented on a diff in pull request #6240: Nessie: Refactor NessieTableOperations#doCommit

2022-11-21 Thread GitBox


nastra commented on code in PR #6240:
URL: https://github.com/apache/iceberg/pull/6240#discussion_r1028365461


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -46,15 +50,7 @@
 import org.projectnessie.error.NessieNamespaceNotFoundException;
 import org.projectnessie.error.NessieNotFoundException;
 import org.projectnessie.error.NessieReferenceNotFoundException;
-import org.projectnessie.model.Branch;
-import org.projectnessie.model.Content;
-import org.projectnessie.model.ContentKey;
-import org.projectnessie.model.EntriesResponse;
-import org.projectnessie.model.GetNamespacesResponse;
-import org.projectnessie.model.IcebergTable;
-import org.projectnessie.model.Operation;
-import org.projectnessie.model.Reference;
-import org.projectnessie.model.Tag;
+import org.projectnessie.model.*;

Review Comment:
   I don't think we allow * imports in the project



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] asheeshgarg commented on issue #6003: Vectorized Read

2022-11-21 Thread GitBox


asheeshgarg commented on issue #6003:
URL: https://github.com/apache/iceberg/issues/6003#issuecomment-1322444351

   @nastra is there a complete list of dependency that I can use for Pure Java 
API program
   above list is complete I have added 
 org.apache.thrift
 libthrift
 0.16.0
   
   Now getting java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.metastore.api.UnknownDBException


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6239: Docs: Select the right Spark catalog

2022-11-21 Thread GitBox


ajantha-bhat commented on code in PR #6239:
URL: https://github.com/apache/iceberg/pull/6239#discussion_r1028370385


##
docs/aws.md:
##
@@ -68,6 +68,7 @@ done
 
 # start Spark SQL client shell
 spark-sql --packages $DEPENDENCIES \
+--conf spark.sql.defaultCatalog=my_catalog \

Review Comment:
   In the same file there are multiple `spark-sql` commands, we need to add to 
all of these and also check in the other files? :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6240: Nessie: Refactor NessieTableOperations#doCommit

2022-11-21 Thread GitBox


ajantha-bhat commented on code in PR #6240:
URL: https://github.com/apache/iceberg/pull/6240#discussion_r1028371605


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -46,15 +50,7 @@
 import org.projectnessie.error.NessieNamespaceNotFoundException;
 import org.projectnessie.error.NessieNotFoundException;
 import org.projectnessie.error.NessieReferenceNotFoundException;
-import org.projectnessie.model.Branch;
-import org.projectnessie.model.Content;
-import org.projectnessie.model.ContentKey;
-import org.projectnessie.model.EntriesResponse;
-import org.projectnessie.model.GetNamespacesResponse;
-import org.projectnessie.model.IcebergTable;
-import org.projectnessie.model.Operation;
-import org.projectnessie.model.Reference;
-import org.projectnessie.model.Tag;
+import org.projectnessie.model.*;

Review Comment:
   It is draft PR. My bad. Let me clean it up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] mderoy opened a new issue, #6241: Updating the HiveCatalog Conf with setConf does not also reset FileIO

2022-11-21 Thread GitBox


mderoy opened a new issue, #6241:
URL: https://github.com/apache/iceberg/issues/6241

   ### Apache Iceberg version
   
   0.14.0
   
   ### Query engine
   
   _No response_
   
   ### Please describe the bug ๐Ÿž
   
   The Iceberg HiveCatalog class lets you set a configuration via
   ```
 @Override
 public void setConf(Configuration conf) {
   this.conf = new Configuration(conf);
 }
   ```
   however doing this does not actually update things in the class which depend 
on it like the FileIO... our application was working properly until we tried to 
remove the S3 credential environment variables and switch to passing them 
programmatically as part of the HiveConfunfortunately though we did this
   ```
m_catalog.initialize("hive", properties);
m_catalog.setConf(config);
   ```
   instead of 
   ```
m_catalog.setConf(config);
m_catalog.initialize("hive", properties);
   ```
   which led to the error `No AWS Credentials provided by 
TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider 
EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider`
   
   Working my way back through the stack I was able to realize that the Conf 
being passed was blank which is why the credentials weren't being setup. 
and reading the source I realized that a HadoopFileIO was set using a blank 
Conf during initialize(), even though our class now had the proper conf set 
after
   
   to avoid such issues I'd prefer if setConf was removed and instead replaced 
with a constructor which accepted the Configuration (and leave a constructor 
without this to create a new blank Conf)... or maybe have setConf throw an 
error if it's already setor maybe even have setConf call initialize() to 
set these things to use the proper Conf


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] mderoy commented on issue #6241: Updating the HiveCatalog Conf with setConf does not also reset FileIO

2022-11-21 Thread GitBox


mderoy commented on issue #6241:
URL: https://github.com/apache/iceberg/issues/6241#issuecomment-1322538070

   I understand this is mostly "user error" but I needed to read the source to 
get to the root of the problem which was not obvious. I'd like it if the 
interface could enforce thisbut in the worst case even if it gets closed, 
the error message is attached to the ticket and is now searchable for anyone 
else that may struggle with this for a few days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko opened a new pull request, #6242: API: Restore the type of the identity transform

2022-11-21 Thread GitBox


Fokko opened a new pull request, #6242:
URL: https://github.com/apache/iceberg/pull/6242

   This caused some regression for the Iceberg 1.1.0 release:
   
   ```
   2022-11-21T12:05:46.6549795Z [ERROR] 
io.trino.plugin.iceberg.TestIcebergSystemTables.testManifestsTable  Time 
elapsed: 0.701 s  <<< FAILURE!
   2022-11-21T12:05:46.6550853Z java.lang.AssertionError:
   2022-11-21T12:05:46.6551986Z [Rows for query [SELECT added_data_files_count, 
existing_rows_count, added_rows_count, deleted_data_files_count, 
deleted_rows_count, partitions FROM test_schema."test_table$manifests"]]
   2022-11-21T12:05:46.6553075Z Expecting:
   2022-11-21T12:05:46.6553593Z   <(2, 0, 3, 0, 0, [[false, false, 18148, 
18149]]), (2, 0, 3, 0, 0, [[false, false, 18147, 18148]])>
   2022-11-21T12:05:46.6553980Z to contain exactly in any order:
   2022-11-21T12:05:46.6554557Z   <[(2, 0, 3, 0, 0, [[false, false, 2019-09-08, 
2019-09-09]]),
   2022-11-21T12:05:46.6554992Z (2, 0, 3, 0, 0, [[false, false, 2019-09-09, 
2019-09-10]])]>
   2022-11-21T12:05:46.6555273Z elements not found:
   2022-11-21T12:05:46.6555804Z   <(2, 0, 3, 0, 0, [[false, false, 2019-09-08, 
2019-09-09]]), (2, 0, 3, 0, 0, [[false, false, 2019-09-09, 2019-09-10]])>
   2022-11-21T12:05:46.6556132Z and elements not expected:
   2022-11-21T12:05:46.6556488Z   <(2, 0, 3, 0, 0, [[false, false, 18148, 
18149]]), (2, 0, 3, 0, 0, [[false, false, 18147, 18148]])>
   ```
   
   The system tables (manifests in this example above), would return the days 
since epoch instead of a date.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028437459


##
python/pyiceberg/table/__init__.py:
##
@@ -16,24 +16,43 @@
 # under the License.
 from __future__ import annotations
 
+from abc import ABC, abstractmethod
+from dataclasses import dataclass
+from functools import cache

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on pull request #6242: API: Restore the type of the identity transform

2022-11-21 Thread GitBox


rdblue commented on PR #6242:
URL: https://github.com/apache/iceberg/pull/6242#issuecomment-1322541809

   Good catch. Thanks, @Fokko!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028440440


##
python/pyiceberg/table/__init__.py:
##
@@ -138,7 +157,10 @@ def __eq__(self, other: Any) -> bool:
 )
 
 
-class TableScan:
+S = TypeVar("S", bound="TableScan", covariant=True)

Review Comment:
   I tried this, but the one in typing_extensions didn't work like the one in 
3.11 so I rolled it back to this type variable. Maybe we can figure it out 
later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko opened a new issue, #6243: Python: BoundType and BoundPredicate should match type

2022-11-21 Thread GitBox


Fokko opened a new issue, #6243:
URL: https://github.com/apache/iceberg/issues/6243

   ### Feature Request / Improvement
   
   When we bind an unbound expression, we'll convert the literal values to the 
type of field that it is being bound to. If we instantiate a BoundPredicate, 
the type should be correct, and we should throw an error otherwise.
   
   ### Query engine
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028472043


##
python/pyiceberg/table/__init__.py:
##
@@ -199,16 +223,143 @@ def use_ref(self, name: str):
 
 raise ValueError(f"Cannot scan unknown ref={name}")
 
-def select(self, *field_names: str) -> TableScan:
+def select(self, *field_names: str) -> S:
 if "*" in self.selected_fields:
 return self.update(selected_fields=field_names)
 return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names
 
-def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan:
+def filter_rows(self, new_row_filter: BooleanExpression) -> S:
 return self.update(row_filter=And(self.row_filter, new_row_filter))
 
-def filter_partitions(self, new_partition_filter: BooleanExpression) -> 
TableScan:
+def filter_partitions(self, new_partition_filter: BooleanExpression) -> S:
 return self.update(partition_filter=And(self.partition_filter, 
new_partition_filter))
 
-def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan:
+def with_case_sensitive(self, case_sensitive: bool = True) -> S:
 return self.update(case_sensitive=case_sensitive)
+
+
+class ScanTask(ABC):
+pass
+
+
+@dataclass(init=False)
+class FileScanTask(ScanTask):
+file: DataFile
+start: int
+length: int
+
+def __init__(self, data_file: DataFile, start: Optional[int] = None, 
length: Optional[int] = None):
+self.file = data_file
+self.start = start or 0
+self.length = length or data_file.file_size_in_bytes
+
+
+class _DictAsStruct(StructProtocol):
+pos_to_name: dict[int, str]
+wrapped: dict[str, Any]
+
+def __init__(self, partition_type: StructType):
+self.pos_to_name = {}
+for pos, field in enumerate(partition_type.fields):
+self.pos_to_name[pos] = field.name
+
+def wrap(self, to_wrap: dict[str, Any]) -> _DictAsStruct:
+self.wrapped = to_wrap
+return self
+
+def get(self, pos: int) -> Any:
+return self.wrapped[self.pos_to_name[pos]]
+
+def set(self, pos: int, value: Any) -> None:
+raise NotImplementedError("Cannot set values in DictAsStruct")
+
+
+class DataScan(TableScan["DataScan"]):
+def __init__(
+self,
+table: Table,
+row_filter: Optional[BooleanExpression] = None,
+partition_filter: Optional[BooleanExpression] = None,
+selected_fields: Tuple[str] = ("*",),
+case_sensitive: bool = True,
+snapshot_id: Optional[int] = None,
+options: Properties = EMPTY_DICT,
+):
+super().__init__(table, row_filter, partition_filter, selected_fields, 
case_sensitive, snapshot_id, options)
+
+def plan_files(self) -> Iterator[ScanTask]:
+snapshot = self.snapshot()
+if not snapshot:
+return ()
+
+io = self.table.io
+
+# step 1: filter manifests using partition summaries
+# the filter depends on the partition spec used to write the manifest 
file, so create a cache of filters for each spec id
+
+@cache
+def manifest_filter(spec_id: int) -> Callable[[ManifestFile], bool]:
+spec = self.table.specs()[spec_id]
+return visitors.manifest_evaluator(spec, self.table.schema(), 
self.partition_filter, self.case_sensitive)
+
+def partition_summary_filter(manifest_file: ManifestFile) -> bool:

Review Comment:
   Updated to the comprehension.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028473168


##
python/pyiceberg/table/__init__.py:
##
@@ -199,16 +223,143 @@ def use_ref(self, name: str):
 
 raise ValueError(f"Cannot scan unknown ref={name}")
 
-def select(self, *field_names: str) -> TableScan:
+def select(self, *field_names: str) -> S:
 if "*" in self.selected_fields:
 return self.update(selected_fields=field_names)
 return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names
 
-def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan:
+def filter_rows(self, new_row_filter: BooleanExpression) -> S:
 return self.update(row_filter=And(self.row_filter, new_row_filter))
 
-def filter_partitions(self, new_partition_filter: BooleanExpression) -> 
TableScan:
+def filter_partitions(self, new_partition_filter: BooleanExpression) -> S:
 return self.update(partition_filter=And(self.partition_filter, 
new_partition_filter))
 
-def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan:
+def with_case_sensitive(self, case_sensitive: bool = True) -> S:
 return self.update(case_sensitive=case_sensitive)
+
+
+class ScanTask(ABC):
+pass
+
+
+@dataclass(init=False)
+class FileScanTask(ScanTask):
+file: DataFile
+start: int
+length: int
+
+def __init__(self, data_file: DataFile, start: Optional[int] = None, 
length: Optional[int] = None):
+self.file = data_file
+self.start = start or 0
+self.length = length or data_file.file_size_in_bytes
+
+
+class _DictAsStruct(StructProtocol):
+pos_to_name: dict[int, str]
+wrapped: dict[str, Any]
+
+def __init__(self, partition_type: StructType):
+self.pos_to_name = {}
+for pos, field in enumerate(partition_type.fields):
+self.pos_to_name[pos] = field.name
+
+def wrap(self, to_wrap: dict[str, Any]) -> _DictAsStruct:
+self.wrapped = to_wrap
+return self
+
+def get(self, pos: int) -> Any:
+return self.wrapped[self.pos_to_name[pos]]
+
+def set(self, pos: int, value: Any) -> None:
+raise NotImplementedError("Cannot set values in DictAsStruct")
+
+
+class DataScan(TableScan["DataScan"]):
+def __init__(
+self,
+table: Table,
+row_filter: Optional[BooleanExpression] = None,
+partition_filter: Optional[BooleanExpression] = None,
+selected_fields: Tuple[str] = ("*",),
+case_sensitive: bool = True,
+snapshot_id: Optional[int] = None,
+options: Properties = EMPTY_DICT,
+):
+super().__init__(table, row_filter, partition_filter, selected_fields, 
case_sensitive, snapshot_id, options)
+
+def plan_files(self) -> Iterator[ScanTask]:
+snapshot = self.snapshot()
+if not snapshot:
+return ()

Review Comment:
   Looks like I can just `return` here and it works.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko commented on a diff in pull request #6232: Python: Disallow Any generics

2022-11-21 Thread GitBox


Fokko commented on code in PR #6232:
URL: https://github.com/apache/iceberg/pull/6232#discussion_r1028473547


##
python/pyiceberg/expressions/literals.py:
##
@@ -207,7 +207,7 @@ def __init__(self, value: int):
 super().__init__(value, int)
 
 @singledispatchmethod
-def to(self, type_var: IcebergType) -> Literal:
+def to(self, type_var: IcebergType) -> Literal:  # type: ignore

Review Comment:
   I don't like this either, the problem is that it is unbound. I have some 
idea's on how to fix this properly in 
https://github.com/apache/iceberg/issues/6231



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028476399


##
python/pyiceberg/table/__init__.py:
##
@@ -16,24 +16,43 @@
 # under the License.
 from __future__ import annotations
 
+from abc import ABC, abstractmethod
+from dataclasses import dataclass
+from functools import cache

Review Comment:
   Moving these methods to the instance and using `lru_cache` caused lint 
failures:
   
   ```
   python/pyiceberg/table/__init__.py:290:5: W1518: 'lru_cache(maxsize=None)' 
or 'cache' will keep all method args alive indefinitely, including 'self' 
(method-cache-max-size-none)
   ```
   
   This was fixed if I use `lru_cache` but don't move the methods to instance 
methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028477122


##
python/pyiceberg/table/__init__.py:
##
@@ -199,16 +223,143 @@ def use_ref(self, name: str):
 
 raise ValueError(f"Cannot scan unknown ref={name}")
 
-def select(self, *field_names: str) -> TableScan:
+def select(self, *field_names: str) -> S:
 if "*" in self.selected_fields:
 return self.update(selected_fields=field_names)
 return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names
 
-def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan:
+def filter_rows(self, new_row_filter: BooleanExpression) -> S:
 return self.update(row_filter=And(self.row_filter, new_row_filter))
 
-def filter_partitions(self, new_partition_filter: BooleanExpression) -> 
TableScan:
+def filter_partitions(self, new_partition_filter: BooleanExpression) -> S:
 return self.update(partition_filter=And(self.partition_filter, 
new_partition_filter))
 
-def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan:
+def with_case_sensitive(self, case_sensitive: bool = True) -> S:
 return self.update(case_sensitive=case_sensitive)
+
+
+class ScanTask(ABC):
+pass
+
+
+@dataclass(init=False)
+class FileScanTask(ScanTask):
+file: DataFile
+start: int
+length: int
+
+def __init__(self, data_file: DataFile, start: Optional[int] = None, 
length: Optional[int] = None):
+self.file = data_file
+self.start = start or 0
+self.length = length or data_file.file_size_in_bytes
+
+
+class _DictAsStruct(StructProtocol):
+pos_to_name: dict[int, str]
+wrapped: dict[str, Any]
+
+def __init__(self, partition_type: StructType):
+self.pos_to_name = {}

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028477438


##
python/pyiceberg/table/__init__.py:
##
@@ -199,16 +223,143 @@ def use_ref(self, name: str):
 
 raise ValueError(f"Cannot scan unknown ref={name}")
 
-def select(self, *field_names: str) -> TableScan:
+def select(self, *field_names: str) -> S:
 if "*" in self.selected_fields:
 return self.update(selected_fields=field_names)
 return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names
 
-def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan:
+def filter_rows(self, new_row_filter: BooleanExpression) -> S:
 return self.update(row_filter=And(self.row_filter, new_row_filter))
 
-def filter_partitions(self, new_partition_filter: BooleanExpression) -> 
TableScan:
+def filter_partitions(self, new_partition_filter: BooleanExpression) -> S:
 return self.update(partition_filter=And(self.partition_filter, 
new_partition_filter))
 
-def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan:
+def with_case_sensitive(self, case_sensitive: bool = True) -> S:
 return self.update(case_sensitive=case_sensitive)
+
+
+class ScanTask(ABC):
+pass
+
+
+@dataclass(init=False)
+class FileScanTask(ScanTask):
+file: DataFile
+start: int
+length: int
+
+def __init__(self, data_file: DataFile, start: Optional[int] = None, 
length: Optional[int] = None):
+self.file = data_file
+self.start = start or 0
+self.length = length or data_file.file_size_in_bytes
+
+
+class _DictAsStruct(StructProtocol):
+pos_to_name: dict[int, str]

Review Comment:
   Done.



##
python/pyiceberg/table/__init__.py:
##
@@ -199,16 +223,143 @@ def use_ref(self, name: str):
 
 raise ValueError(f"Cannot scan unknown ref={name}")
 
-def select(self, *field_names: str) -> TableScan:
+def select(self, *field_names: str) -> S:
 if "*" in self.selected_fields:
 return self.update(selected_fields=field_names)
 return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names
 
-def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan:
+def filter_rows(self, new_row_filter: BooleanExpression) -> S:
 return self.update(row_filter=And(self.row_filter, new_row_filter))
 
-def filter_partitions(self, new_partition_filter: BooleanExpression) -> 
TableScan:
+def filter_partitions(self, new_partition_filter: BooleanExpression) -> S:
 return self.update(partition_filter=And(self.partition_filter, 
new_partition_filter))
 
-def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan:
+def with_case_sensitive(self, case_sensitive: bool = True) -> S:
 return self.update(case_sensitive=case_sensitive)
+
+
+class ScanTask(ABC):
+pass
+
+
+@dataclass(init=False)
+class FileScanTask(ScanTask):
+file: DataFile
+start: int
+length: int
+
+def __init__(self, data_file: DataFile, start: Optional[int] = None, 
length: Optional[int] = None):
+self.file = data_file
+self.start = start or 0
+self.length = length or data_file.file_size_in_bytes
+
+
+class _DictAsStruct(StructProtocol):
+pos_to_name: dict[int, str]
+wrapped: dict[str, Any]

Review Comment:
   Done.



##
python/pyiceberg/table/__init__.py:
##
@@ -199,16 +223,143 @@ def use_ref(self, name: str):
 
 raise ValueError(f"Cannot scan unknown ref={name}")
 
-def select(self, *field_names: str) -> TableScan:
+def select(self, *field_names: str) -> S:
 if "*" in self.selected_fields:
 return self.update(selected_fields=field_names)
 return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names
 
-def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan:
+def filter_rows(self, new_row_filter: BooleanExpression) -> S:
 return self.update(row_filter=And(self.row_filter, new_row_filter))
 
-def filter_partitions(self, new_partition_filter: BooleanExpression) -> 
TableScan:
+def filter_partitions(self, new_partition_filter: BooleanExpression) -> S:
 return self.update(partition_filter=And(self.partition_filter, 
new_partition_filter))
 
-def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan:
+def with_case_sensitive(self, case_sensitive: bool = True) -> S:
 return self.update(case_sensitive=case_sensitive)
+
+
+class ScanTask(ABC):
+pass
+
+
+@dataclass(init=False)
+class FileScanTask(ScanTask):
+file: DataFile
+start: int
+length: int
+
+def __init__(self, data_file: DataFile, start: Optional[int] = None, 
length: Optional[int] = None):
+self.file = data_file
+self.start = start or

[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028478168


##
python/pyproject.toml:
##
@@ -84,6 +84,7 @@ build-backend = "poetry.core.masonry.api"
 
 [tool.poetry.extras]
 pyarrow = ["pyarrow"]
+duckdb = ["duckdb"]

Review Comment:
   Added the import. Will update docs, too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko merged pull request #6232: Python: Disallow Any generics

2022-11-21 Thread GitBox


Fokko merged PR #6232:
URL: https://github.com/apache/iceberg/pull/6232


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028489268


##
python/pyiceberg/table/__init__.py:
##
@@ -199,16 +223,143 @@ def use_ref(self, name: str):
 
 raise ValueError(f"Cannot scan unknown ref={name}")
 
-def select(self, *field_names: str) -> TableScan:
+def select(self, *field_names: str) -> S:
 if "*" in self.selected_fields:
 return self.update(selected_fields=field_names)
 return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names
 
-def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan:
+def filter_rows(self, new_row_filter: BooleanExpression) -> S:
 return self.update(row_filter=And(self.row_filter, new_row_filter))
 
-def filter_partitions(self, new_partition_filter: BooleanExpression) -> 
TableScan:
+def filter_partitions(self, new_partition_filter: BooleanExpression) -> S:
 return self.update(partition_filter=And(self.partition_filter, 
new_partition_filter))
 
-def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan:
+def with_case_sensitive(self, case_sensitive: bool = True) -> S:
 return self.update(case_sensitive=case_sensitive)
+
+
+class ScanTask(ABC):
+pass
+
+
+@dataclass(init=False)
+class FileScanTask(ScanTask):
+file: DataFile
+start: int
+length: int
+
+def __init__(self, data_file: DataFile, start: Optional[int] = None, 
length: Optional[int] = None):
+self.file = data_file
+self.start = start or 0
+self.length = length or data_file.file_size_in_bytes
+
+
+class _DictAsStruct(StructProtocol):
+pos_to_name: dict[int, str]
+wrapped: dict[str, Any]
+
+def __init__(self, partition_type: StructType):
+self.pos_to_name = {}
+for pos, field in enumerate(partition_type.fields):
+self.pos_to_name[pos] = field.name
+
+def wrap(self, to_wrap: dict[str, Any]) -> _DictAsStruct:
+self.wrapped = to_wrap
+return self
+
+def get(self, pos: int) -> Any:
+return self.wrapped[self.pos_to_name[pos]]
+
+def set(self, pos: int, value: Any) -> None:
+raise NotImplementedError("Cannot set values in DictAsStruct")
+
+
+class DataScan(TableScan["DataScan"]):
+def __init__(
+self,
+table: Table,
+row_filter: Optional[BooleanExpression] = None,
+partition_filter: Optional[BooleanExpression] = None,
+selected_fields: Tuple[str] = ("*",),
+case_sensitive: bool = True,
+snapshot_id: Optional[int] = None,
+options: Properties = EMPTY_DICT,
+):
+super().__init__(table, row_filter, partition_filter, selected_fields, 
case_sensitive, snapshot_id, options)
+
+def plan_files(self) -> Iterator[ScanTask]:
+snapshot = self.snapshot()
+if not snapshot:
+return ()
+
+io = self.table.io
+
+# step 1: filter manifests using partition summaries
+# the filter depends on the partition spec used to write the manifest 
file, so create a cache of filters for each spec id
+
+@cache
+def manifest_filter(spec_id: int) -> Callable[[ManifestFile], bool]:

Review Comment:
   This conflicted with the use of `@lru_cache`, so I ended up removing the 
cache and making these instance helper methods. To cache the values, I added a 
`KeyDefaultDict` that initializes values using these helpers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028489268


##
python/pyiceberg/table/__init__.py:
##
@@ -199,16 +223,143 @@ def use_ref(self, name: str):
 
 raise ValueError(f"Cannot scan unknown ref={name}")
 
-def select(self, *field_names: str) -> TableScan:
+def select(self, *field_names: str) -> S:
 if "*" in self.selected_fields:
 return self.update(selected_fields=field_names)
 return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names
 
-def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan:
+def filter_rows(self, new_row_filter: BooleanExpression) -> S:
 return self.update(row_filter=And(self.row_filter, new_row_filter))
 
-def filter_partitions(self, new_partition_filter: BooleanExpression) -> 
TableScan:
+def filter_partitions(self, new_partition_filter: BooleanExpression) -> S:
 return self.update(partition_filter=And(self.partition_filter, 
new_partition_filter))
 
-def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan:
+def with_case_sensitive(self, case_sensitive: bool = True) -> S:
 return self.update(case_sensitive=case_sensitive)
+
+
+class ScanTask(ABC):
+pass
+
+
+@dataclass(init=False)
+class FileScanTask(ScanTask):
+file: DataFile
+start: int
+length: int
+
+def __init__(self, data_file: DataFile, start: Optional[int] = None, 
length: Optional[int] = None):
+self.file = data_file
+self.start = start or 0
+self.length = length or data_file.file_size_in_bytes
+
+
+class _DictAsStruct(StructProtocol):
+pos_to_name: dict[int, str]
+wrapped: dict[str, Any]
+
+def __init__(self, partition_type: StructType):
+self.pos_to_name = {}
+for pos, field in enumerate(partition_type.fields):
+self.pos_to_name[pos] = field.name
+
+def wrap(self, to_wrap: dict[str, Any]) -> _DictAsStruct:
+self.wrapped = to_wrap
+return self
+
+def get(self, pos: int) -> Any:
+return self.wrapped[self.pos_to_name[pos]]
+
+def set(self, pos: int, value: Any) -> None:
+raise NotImplementedError("Cannot set values in DictAsStruct")
+
+
+class DataScan(TableScan["DataScan"]):
+def __init__(
+self,
+table: Table,
+row_filter: Optional[BooleanExpression] = None,
+partition_filter: Optional[BooleanExpression] = None,
+selected_fields: Tuple[str] = ("*",),
+case_sensitive: bool = True,
+snapshot_id: Optional[int] = None,
+options: Properties = EMPTY_DICT,
+):
+super().__init__(table, row_filter, partition_filter, selected_fields, 
case_sensitive, snapshot_id, options)
+
+def plan_files(self) -> Iterator[ScanTask]:
+snapshot = self.snapshot()
+if not snapshot:
+return ()
+
+io = self.table.io
+
+# step 1: filter manifests using partition summaries
+# the filter depends on the partition spec used to write the manifest 
file, so create a cache of filters for each spec id
+
+@cache
+def manifest_filter(spec_id: int) -> Callable[[ManifestFile], bool]:

Review Comment:
   This conflicted with the use of `@lru_cache`, so I ended up removing the 
cache and making these instance helper methods. To cache the values, I added a 
[`KeyDefaultDict`](https://stackoverflow.com/questions/2912231/is-there-a-clever-way-to-pass-the-key-to-defaultdicts-default-factory)
 that initializes values using these helpers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028501655


##
python/pyiceberg/table/__init__.py:
##
@@ -16,24 +16,43 @@
 # under the License.
 from __future__ import annotations
 
+from abc import ABC, abstractmethod
+from dataclasses import dataclass
+from functools import cache

Review Comment:
   I removed this in favor of a dict that creates default values based on key.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] ahshahid commented on issue #6198: colStats flag in TableContext remains false except in situation where delete files are present

2022-11-21 Thread GitBox


ahshahid commented on issue #6198:
URL: https://github.com/apache/iceberg/issues/6198#issuecomment-1322664170

   I am trying to reproduce the behaviour which I saw


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] asheeshgarg commented on issue #6003: Vectorized Read

2022-11-21 Thread GitBox


asheeshgarg commented on issue #6003:
URL: https://github.com/apache/iceberg/issues/6003#issuecomment-1322683140

   After could of jar iteration able to read the catalog now for reading the 
data I am getting
   Exception in thread "main" java.lang.ExceptionInInitializerError
   at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1161)
   at 
org.apache.arrow.vector.BaseVariableWidthVector.initOffsetBuffer(BaseVariableWidthVector.java:242)
   at 
org.apache.arrow.vector.BaseVariableWidthVector.allocateBytes(BaseVariableWidthVector.java:467)
   at 
org.apache.arrow.vector.BaseVariableWidthVector.allocateNew(BaseVariableWidthVector.java:419)
   at 
org.apache.arrow.vector.BaseVariableWidthVector.allocateNewSafe(BaseVariableWidthVector.java:393)
   at 
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateVectorBasedOnOriginalType(VectorizedArrowReader.java:247)
   at 
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateFieldVector(VectorizedArrowReader.java:218)
   at 
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:132)
   at 
org.apache.iceberg.arrow.vectorized.ArrowBatchReader.read(ArrowBatchReader.java:46)
   at 
org.apache.iceberg.arrow.vectorized.ArrowBatchReader.read(ArrowBatchReader.java:29)
   at 
org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:149)
   at 
org.apache.iceberg.arrow.vectorized.ArrowReader$VectorizedCombinedScanIterator.next(ArrowReader.java:313)
   at 
org.apache.iceberg.arrow.vectorized.ArrowReader$VectorizedCombinedScanIterator.next(ArrowReader.java:189)
   at com.jpmorgan.Test.readContent(Test.java:62)
   at com.jpmorgan.Test.main(Test.java:72)
   Caused by: java.lang.RuntimeException: Failed to initialize MemoryUtil.
   at 
org.apache.arrow.memory.util.MemoryUtil.(MemoryUtil.java:136)
   ... 15 more
   Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make 
field long java.nio.Buffer.address accessible: module java.base does not "opens 
java.nio" to unnamed module @3fb875d8
   at 
java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
   at 
java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
   at 
java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:178)
   at java.base/java.lang.reflect.Field.setAccessible(Field.java:172)
   at 
org.apache.arrow.memory.util.MemoryUtil.(MemoryUtil.java:84)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] JonasJ-ap commented on pull request #5331: WIP: Adding support for Delta to Iceberg migration

2022-11-21 Thread GitBox


JonasJ-ap commented on PR #5331:
URL: https://github.com/apache/iceberg/pull/5331#issuecomment-1322684304

   Hi @ericlgoodman. My name is Rushan Jiang, a CS undergrad at CMU. I am 
interested in learning and contributing to this migration support. I saw you 
did not update this PR for some time. Would you mind allowing me to continue 
your work? 
   
   I appreciate your time and consideration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko commented on a diff in pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


Fokko commented on code in PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#discussion_r1028552098


##
python/pyiceberg/table/__init__.py:
##
@@ -199,16 +223,144 @@ def use_ref(self, name: str):
 
 raise ValueError(f"Cannot scan unknown ref={name}")
 
-def select(self, *field_names: str) -> TableScan:
+def select(self, *field_names: str) -> S:
 if "*" in self.selected_fields:
 return self.update(selected_fields=field_names)
 return 
self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names
 
-def filter_rows(self, new_row_filter: BooleanExpression) -> TableScan:
+def filter_rows(self, new_row_filter: BooleanExpression) -> S:
 return self.update(row_filter=And(self.row_filter, new_row_filter))
 
-def filter_partitions(self, new_partition_filter: BooleanExpression) -> 
TableScan:
+def filter_partitions(self, new_partition_filter: BooleanExpression) -> S:
 return self.update(partition_filter=And(self.partition_filter, 
new_partition_filter))
 
-def with_case_sensitive(self, case_sensitive: bool = True) -> TableScan:
+def with_case_sensitive(self, case_sensitive: bool = True) -> S:
 return self.update(case_sensitive=case_sensitive)
+
+
+class ScanTask(ABC):
+pass
+
+
+@dataclass(init=False)
+class FileScanTask(ScanTask):
+file: DataFile
+start: int
+length: int
+
+def __init__(self, data_file: DataFile, start: Optional[int] = None, 
length: Optional[int] = None):
+self.file = data_file
+self.start = start or 0
+self.length = length or data_file.file_size_in_bytes
+
+
+class _DictAsStruct(StructProtocol):
+pos_to_name: Dict[int, str]
+wrapped: Dict[str, Any]
+
+def __init__(self, partition_type: StructType):
+self.pos_to_name = {pos: field.name for pos, field in 
enumerate(partition_type.fields)}
+
+def wrap(self, to_wrap: Dict[str, Any]) -> _DictAsStruct:
+self.wrapped = to_wrap
+return self
+
+def get(self, pos: int) -> Any:
+return self.wrapped[self.pos_to_name[pos]]
+
+def set(self, pos: int, value: Any) -> None:
+raise NotImplementedError("Cannot set values in DictAsStruct")
+
+
+class DataScan(TableScan["DataScan"]):
+def __init__(
+self,
+table: Table,
+row_filter: Optional[BooleanExpression] = None,
+partition_filter: Optional[BooleanExpression] = None,
+selected_fields: Tuple[str] = ("*",),
+case_sensitive: bool = True,
+snapshot_id: Optional[int] = None,
+options: Properties = EMPTY_DICT,
+):
+super().__init__(table, row_filter, partition_filter, selected_fields, 
case_sensitive, snapshot_id, options)
+
+def _build_manifest_evaluator(self, spec_id: int) -> 
Callable[[ManifestFile], bool]:
+spec = self.table.specs()[spec_id]
+return visitors.manifest_evaluator(spec, self.table.schema(), 
self.partition_filter, self.case_sensitive)
+
+def _build_partition_evaluator(self, spec_id: int) -> Callable[[DataFile], 
bool]:
+spec = self.table.specs()[spec_id]
+partition_type = spec.partition_type(self.table.schema())
+partition_schema = Schema(*partition_type.fields)
+
+# TODO: project the row filter  # pylint: disable=W0511
+partition_expr = And(self.partition_filter, AlwaysTrue())
+
+# TODO: remove the dict to struct wrapper by using a StructProtocol 
record  # pylint: disable=W0511
+wrapper = _DictAsStruct(partition_type)
+evaluator = visitors.expression_evaluator(partition_schema, 
partition_expr, self.case_sensitive)
+
+return lambda data_file: evaluator(wrapper.wrap(data_file.partition))
+
+def plan_files(self) -> Iterator[ScanTask]:
+snapshot = self.snapshot()
+if not snapshot:
+return
+
+io = self.table.io
+
+# step 1: filter manifests using partition summaries
+# the filter depends on the partition spec used to write the manifest 
file, so create a cache of filters for each spec id
+
+manifest_evaluators: Dict[int, Callable[[ManifestFile], bool]] = 
KeyDefaultDict(self._build_manifest_evaluator)
+
+manifests = [
+manifest_file
+for manifest_file in snapshot.manifests(io)
+if 
manifest_evaluators[manifest_file.partition_spec_id](manifest_file)
+]
+
+# step 2: filter the data files in each manifest
+# this filter depends on the partition spec used to write the manifest 
file
+
+partition_evaluators: Dict[int, Callable[[DataFile], bool]] = 
KeyDefaultDict(self._build_partition_evaluator)

Review Comment:
   I like this, very elegant!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsub

[GitHub] [iceberg] rdblue merged pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue merged PR #6233:
URL: https://github.com/apache/iceberg/pull/6233


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on pull request #6233: Python: Implement DataScan.plan_files

2022-11-21 Thread GitBox


rdblue commented on PR #6233:
URL: https://github.com/apache/iceberg/pull/6233#issuecomment-1322733138

   Thanks, @Fokko! I'll follow up with testing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6128: Python: Projection

2022-11-21 Thread GitBox


rdblue commented on code in PR #6128:
URL: https://github.com/apache/iceberg/pull/6128#discussion_r1028599283


##
python/pyiceberg/expressions/__init__.py:
##
@@ -601,36 +640,65 @@ def __eq__(self, other):
 def __repr__(self) -> str:
 return f"{str(self.__class__.__name__)}(term={repr(self.term)}, 
literal={repr(self.literal)})"
 
+@property
+@abstractmethod
+def as_unbound(self) -> Type[LiteralPredicate[L]]:
+...
+
 
 class BoundEqualTo(BoundLiteralPredicate[L]):
 def __invert__(self) -> BoundNotEqualTo[L]:
 return BoundNotEqualTo[L](self.term, self.literal)
 
+@property
+def as_unbound(self) -> Type[EqualTo[L]]:
+return EqualTo
+
 
 class BoundNotEqualTo(BoundLiteralPredicate[L]):
 def __invert__(self) -> BoundEqualTo[L]:
 return BoundEqualTo[L](self.term, self.literal)
 
+@property
+def as_unbound(self) -> Type[NotEqualTo[L]]:
+return NotEqualTo

Review Comment:
   Does this need the type param, `NotEqualTo[L]`? It looks like `GreaterThan` 
has it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6128: Python: Projection

2022-11-21 Thread GitBox


rdblue commented on code in PR #6128:
URL: https://github.com/apache/iceberg/pull/6128#discussion_r1028602287


##
python/pyiceberg/transforms.py:
##
@@ -644,8 +748,88 @@ def can_transform(self, _: IcebergType) -> bool:
 def result_type(self, source: IcebergType) -> IcebergType:
 return source
 
+def project(self, name: str, pred: BoundPredicate[L]) -> 
Optional[UnboundPredicate[Any]]:
+return None
+
 def to_human_string(self, _: IcebergType, value: Optional[S]) -> str:
 return "null"
 
 def __repr__(self) -> str:
 return "VoidTransform()"
+
+
+def _truncate_number(
+name: str, pred: BoundLiteralPredicate[L], transform: 
Callable[[Optional[L]], Optional[L]]
+) -> Optional[UnboundPredicate[Any]]:
+boundary = pred.literal
+
+if not isinstance(boundary, (LongLiteral, DecimalLiteral, DateLiteral, 
TimestampLiteral)):
+raise ValueError(f"Expected a numeric literal, got: {type(boundary)}")
+
+if isinstance(pred, BoundLessThan):
+return LessThanOrEqual(Reference(name), _transform_literal(transform, 
boundary.decrement()))  # type: ignore
+elif isinstance(pred, BoundLessThanOrEqual):
+return LessThanOrEqual(Reference(name), _transform_literal(transform, 
boundary))
+elif isinstance(pred, BoundGreaterThan):
+return GreaterThanOrEqual(Reference(name), 
_transform_literal(transform, boundary.increment()))  # type: ignore
+elif isinstance(pred, BoundGreaterThanOrEqual):
+return GreaterThanOrEqual(Reference(name), 
_transform_literal(transform, boundary))
+elif isinstance(pred, BoundEqualTo):
+return EqualTo(Reference(name), _transform_literal(transform, 
boundary))
+else:
+return None
+
+
+def _truncate_array(
+name: str, pred: BoundLiteralPredicate[L], transform: 
Callable[[Optional[L]], Optional[L]]
+) -> Optional[UnboundPredicate[Any]]:
+boundary = pred.literal
+
+if type(pred) in {BoundLessThan, BoundLessThanOrEqual}:
+return LessThanOrEqual(Reference(name), _transform_literal(transform, 
boundary))
+elif type(pred) in {BoundGreaterThan, BoundGreaterThanOrEqual}:
+return GreaterThanOrEqual(Reference(name), 
_transform_literal(transform, boundary))
+if isinstance(pred, BoundEqualTo):
+return EqualTo(Reference(name), _transform_literal(transform, 
boundary))
+else:
+return None
+
+
+def _project_transform_predicate(
+transform: Transform[Any, Any], partition_name: str, pred: 
BoundPredicate[L]
+) -> Optional[UnboundPredicate[Any]]:
+term = pred.term
+if isinstance(term, BoundTransform) and transform == term.transform:
+return _remove_transform(partition_name, pred)
+return None
+
+
+def _remove_transform(partition_name: str, pred: BoundPredicate[L]):
+if isinstance(pred, BoundUnaryPredicate):
+return pred.as_unbound(Reference(partition_name))
+elif isinstance(pred, BoundLiteralPredicate):
+return pred.as_unbound(Reference(partition_name), pred.literal)
+elif isinstance(pred, (BoundIn, BoundNotIn)):
+return pred.as_unbound(Reference(partition_name), pred.literals)
+else:
+raise ValueError(f"Cannot replace transform in unknown predicate: 
{pred}")
+
+
+def _set_apply_transform(
+name: str, pred: BoundSetPredicate[L], transform: Callable[[L], L]
+) -> UnboundPredicate[Any]:
+literals = pred.literals
+if isinstance(pred, BoundSetPredicate):
+return pred.as_unbound(Reference(name), {_transform_literal(transform, 
literal) for literal in literals})
+else:
+raise ValueError(f"Unknown BoundSetPredicate: {pred}")

Review Comment:
   Nit: in this case, it isn't a `BoundSetPredicate` because `isinstance` 
returned false.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6128: Python: Projection

2022-11-21 Thread GitBox


rdblue commented on code in PR #6128:
URL: https://github.com/apache/iceberg/pull/6128#discussion_r1028604963


##
python/pyiceberg/transforms.py:
##
@@ -644,8 +751,89 @@ def can_transform(self, _: IcebergType) -> bool:
 def result_type(self, source: IcebergType) -> IcebergType:
 return source
 
+def project(self, name: str, pred: BoundPredicate) -> 
Optional[UnboundPredicate]:
+return None
+
 def to_human_string(self, _: IcebergType, value: Optional[S]) -> str:
 return "null"
 
 def __repr__(self) -> str:
 return "VoidTransform()"
+
+
+M = TypeVar("M")
+
+
+def _truncate_number(
+name: str, pred: BoundLiteralPredicate, transform: Callable[[Optional[M]], 
Optional[M]]
+) -> Optional[UnboundPredicate]:
+boundary = pred.literal
+
+if not isinstance(boundary, (LongLiteral, DecimalLiteral, DateLiteral, 
TimeLiteral, TimestampLiteral)):
+raise ValueError(f"Expected a numeric literal, got: {type(boundary)}")
+
+if isinstance(pred, BoundLessThan):
+return LessThanOrEqual(Reference(name), _transform_literal(transform, 
boundary.decrement()))
+elif isinstance(pred, BoundLessThanOrEqual):
+return LessThanOrEqual(Reference(name), _transform_literal(transform, 
boundary))
+elif isinstance(pred, BoundGreaterThan):
+return GreaterThanOrEqual(Reference(name), 
_transform_literal(transform, boundary.increment()))
+elif isinstance(pred, BoundGreaterThanOrEqual):
+return GreaterThanOrEqual(Reference(name), 
_transform_literal(transform, boundary))
+elif isinstance(pred, BoundEqualTo):
+return EqualTo(Reference(name), _transform_literal(transform, 
boundary))
+else:
+return None
+
+
+def _truncate_array(
+name: str, pred: BoundLiteralPredicate, func: Callable[[Optional[M]], 
Optional[M]]
+) -> Optional[UnboundPredicate]:
+boundary = pred.literal
+
+if type(pred) in {BoundLessThan, BoundLessThanOrEqual}:
+return LessThanOrEqual(Reference(name), _transform_literal(func, 
boundary))
+elif type(pred) in {BoundGreaterThan, BoundGreaterThanOrEqual}:
+return GreaterThanOrEqual(Reference(name), _transform_literal(func, 
boundary))
+if isinstance(pred, BoundEqualTo):
+return EqualTo(Reference(name), _transform_literal(func, boundary))
+else:
+return None
+
+
+def _project_transform_predicate(transform: Transform, partition_name: str, 
pred: BoundPredicate) -> Optional[UnboundPredicate]:
+term = pred.term
+if isinstance(term, BoundTransform) and transform == term.transform:
+return _remove_transform(partition_name, pred)
+return None
+
+
+def _remove_transform(partition_name: str, pred: BoundPredicate):
+if isinstance(pred, BoundUnaryPredicate):
+return pred.as_unbound(Reference(partition_name))
+elif isinstance(pred, BoundLiteralPredicate):
+return pred.as_unbound(Reference(partition_name), pred.literal)
+elif isinstance(pred, (BoundIn, BoundNotIn)):
+return pred.as_unbound(Reference(partition_name), pred.literals)
+else:
+raise ValueError(f"Cannot replace transform in unknown predicate: 
{pred}")
+
+
+def _transform_set(name: str, pred: BoundSetPredicate, func: 
Callable[[Optional[M]], Optional[M]]) -> UnboundPredicate:
+literals = pred.literals
+if isinstance(pred, BoundIn):
+return In(Reference(name), {_transform_literal(func, literal) for 
literal in literals})
+elif isinstance(pred, BoundIn):
+return NotIn(Reference(name), {_transform_literal(func, literal) for 
literal in literals})
+else:
+raise ValueError(f"Unknown BoundSetPredicate: {pred}")
+
+
+class BoundTransform(BoundTerm[T]):

Review Comment:
   Strange. I don't see an import for expressions. In any case, we can fix it 
later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6128: Python: Projection

2022-11-21 Thread GitBox


rdblue commented on code in PR #6128:
URL: https://github.com/apache/iceberg/pull/6128#discussion_r1028605806


##
python/pyiceberg/transforms.py:
##
@@ -511,6 +590,31 @@ def preserves_order(self) -> bool:
 def source_type(self) -> IcebergType:
 return self._source_type
 
+def project(self, name: str, pred: BoundPredicate) -> 
Optional[UnboundPredicate]:
+field_type = pred.term.ref().field.field_type
+
+if isinstance(pred.term, BoundTransform):
+return _project_transform_predicate(self, name, pred)
+
+# Implement startswith and notstartswith for string (and probably 
binary)
+# https://github.com/apache/iceberg/issues/6112
+
+if isinstance(pred, BoundUnaryPredicate):
+return pred.as_unbound(Reference(name))
+
+if isinstance(field_type, (IntegerType, LongType, DecimalType)):
+if isinstance(pred, BoundLiteralPredicate):
+return _truncate_number(name, pred, self.transform(field_type))
+elif isinstance(pred, BoundIn):
+return _transform_set(name, pred, self.transform(field_type))

Review Comment:
   Yeah, I just meant for the inner `elif` case, where both call 
`_set_apply_transform`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on pull request #6128: Python: Projection

2022-11-21 Thread GitBox


rdblue commented on PR #6128:
URL: https://github.com/apache/iceberg/pull/6128#issuecomment-1322801754

   I fixed a formatting issue and tests are passing! Overall this looks great 
with only a couple nits left, so I'll merge it. Thanks for getting this in, 
@Fokko!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue merged pull request #6128: Python: Projection

2022-11-21 Thread GitBox


rdblue merged PR #6128:
URL: https://github.com/apache/iceberg/pull/6128


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on pull request #6069: Python: TableScan Plan files API implementation without residual evaluation

2022-11-21 Thread GitBox


rdblue commented on PR #6069:
URL: https://github.com/apache/iceberg/pull/6069#issuecomment-1322809905

   @dhruv-pratap, we've been working on the list lately and I think the 
remaining items are: (4) add `MetricsEvalVisitor` to prune by column stats, (5) 
add `ResidualEvalVisitor` to produce residuals (LOW priority), and (8) expose 
`plan_files` via the CLI.
   
   For the last one, we may just want to add filters to the existing `files` 
command?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] dhruv-pratap commented on pull request #6069: Python: TableScan Plan files API implementation without residual evaluation

2022-11-21 Thread GitBox


dhruv-pratap commented on PR #6069:
URL: https://github.com/apache/iceberg/pull/6069#issuecomment-1322841113

   > @dhruv-pratap, we've been working on the list lately and I think the 
remaining items are: (4) add `MetricsEvalVisitor` to prune by column stats, (5) 
add `ResidualEvalVisitor` to produce residuals (LOW priority), and (8) expose 
`plan_files` via the CLI.
   > 
   > For the last one, we may just want to add filters to the existing `files` 
command?
   
   @rdblue We discussed this more in 
[this](https://github.com/apache/iceberg/pull/6069#discussion_r1012044761) 
thread. We can extend `files` with three options:
   
   * `files --snapshot=all` - The current files behavior
   * `files --snapshot=current` - Files under the current snapshot
   * `files --snapshot=` - Files under the snapshot_id specified
   
   Adding `expression` option would be a bigger undertaking though, as it would 
require an expression parser. Not sure if we want to prioritize that at this 
point for the CLI. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] emkornfield commented on issue #644: Views on top of Iceberg tables

2022-11-21 Thread GitBox


emkornfield commented on issue #644:
URL: https://github.com/apache/iceberg/issues/644#issuecomment-1322849306

   What is the current state of Views, in general, it seems like there has been 
development effort here but I didn't see a vote on the mailing list officially 
adopting the specification (could just be my poor search abilities).  Is the 
specification formally ratified?  If not, what are the milestones before a vote 
happens?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] luoyuxia commented on issue #6104: Rewrite iceberg small files with flink succeeds but no snapshot is generated (V2 - upsert model)

2022-11-21 Thread GitBox


luoyuxia commented on issue #6104:
URL: https://github.com/apache/iceberg/issues/6104#issuecomment-1322899747

   > This means that in the CDC data that is streaming to Iceberg, don't have a 
viable data compression scheme for data streams that contain delete operations 
at this stage?
   
   Yes, I'm afraid so.
   
   > Here's a question,is it possible to pause the writer for data compression 
once, and when the data compression is completed, resume the data writing from 
the checkpoint again, and handle the above commit exception by cyclically 
suspending, compressing, and writing again?
   
   I think it's possible. From the code, IIUC, the exception happens when you 
start a compression, but it find delete files before finish compression. And 
once it won't produce any deletes files between start and finish a compression, 
there shouldn't throw exception.
   But I'm not sure, you can have a try and to see whether it works.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] hameizi commented on pull request #6237: Core: Fix check is delete file and data file overlap

2022-11-21 Thread GitBox


hameizi commented on PR #6237:
URL: https://github.com/apache/iceberg/pull/6237#issuecomment-1322908600

   > I think the current judgment has already dealt with this situation, IIUC, 
deletes and data wil **not overlap** if: `(dataLower > deleteUpper) || 
(deleteLower > dataUpper)` So they **will overlap** when `!((dataLower > 
deleteUpper) || (deleteLower > dataUpper)) ` -> `(dataLower <= deleteUpper) && 
(deleteLower <= dataUpper)` which is exactly the current judgment.
   
   you are right, i will close this pr.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] hameizi closed pull request #6237: Core: Fix check is delete file and data file overlap

2022-11-21 Thread GitBox


hameizi closed pull request #6237: Core: Fix check is delete file and data file 
overlap
URL: https://github.com/apache/iceberg/pull/6237


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] SHuixo commented on issue #6104: Rewrite iceberg small files with flink succeeds but no snapshot is generated (V2 - upsert model)

2022-11-21 Thread GitBox


SHuixo commented on issue #6104:
URL: https://github.com/apache/iceberg/issues/6104#issuecomment-1322917998

   Yes, we tried it and found it worked. 
   The following two log messages are the results of the compression test.

   > 
[compact-data-when-stream-write.log](https://github.com/apache/iceberg/files/10057144/compact-data-when-stream-write.log)
   > 
[compact-data-when-write-finished.log](https://github.com/apache/iceberg/files/10057146/compact-data-when-write-finish.log)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] SHuixo closed issue #6104: Rewrite iceberg small files with flink succeeds but no snapshot is generated (V2 - upsert model)

2022-11-21 Thread GitBox


SHuixo closed issue #6104: Rewrite iceberg small files with flink succeeds but 
no snapshot is generated (V2 - upsert model)
URL: https://github.com/apache/iceberg/issues/6104


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6239: Docs: Select the right Spark catalog

2022-11-21 Thread GitBox


ajantha-bhat commented on code in PR #6239:
URL: https://github.com/apache/iceberg/pull/6239#discussion_r1028370385


##
docs/aws.md:
##
@@ -68,6 +68,7 @@ done
 
 # start Spark SQL client shell
 spark-sql --packages $DEPENDENCIES \
+--conf spark.sql.defaultCatalog=my_catalog \

Review Comment:
   In the same file there are multiple `spark-sql` commands, we need to add to 
all of these and also check in the other files? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6217: Core: Fix deletion of old metadata files when METADATA_DELETE_AFTER_COMMIT_ENABLED is set

2022-11-21 Thread GitBox


amogh-jahagirdar commented on code in PR #6217:
URL: https://github.com/apache/iceberg/pull/6217#discussion_r1028798630


##
core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java:
##
@@ -414,16 +414,15 @@ private void deleteRemovedMetadataFiles(TableMetadata 
base, TableMetadata metada
 TableProperties.METADATA_DELETE_AFTER_COMMIT_ENABLED_DEFAULT);
 
 if (deleteAfterCommit) {
-  Set removedPreviousMetadataFiles =
-  Sets.newHashSet(base.previousFiles());
-  removedPreviousMetadataFiles.removeAll(metadata.previousFiles());
-  Tasks.foreach(removedPreviousMetadataFiles)
+  Set filesToDelete = 
Sets.newHashSet(base.previousFiles());
+  filesToDelete.addAll(metadata.previousFiles());
+  Tasks.foreach(filesToDelete)
   .noRetry()
   .suppressFailureWhenFinished()
   .onFailure(
-  (previousMetadataFile, exc) ->
+  (metadataFileToDelete, exc) ->
   LOG.warn(
-  "Delete failed for previous metadata file: {}", 
previousMetadataFile, exc))
+  "Delete failed for previous metadata file: {}", 
metadataFileToDelete, exc))
   .run(previousMetadataFile -> 
io().deleteFile(previousMetadataFile.file()));

Review Comment:
   Should this use batch delete if applicable?  I don't think it's a blocker 
though, something we can get in a follow on if it's too intrusive.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] arunb2w opened a new issue, #6244: Iceberg metadata not stored properly

2022-11-21 Thread GitBox


arunb2w opened a new issue, #6244:
URL: https://github.com/apache/iceberg/issues/6244

   I tried creating an sample iceberg table with below schema
   
   ```
CREATE TABLE glue_dev.db.datatype_test (
   id bigint,
   data string,
   category string
   )
   USING iceberg
   TBLPROPERTIES ('read.split.target-size'='134217728', 
"write.metadata.metrics.default"="full")
   ```
   
   Then, inserted around 100 records and then did rewrite files after that so 
that all the inserted data will be rewritten in a single file.
   
   ```
   for num in range(1,100)
   INSERT INTO glue_dev.db.datatype_test VALUES ({num}, 'data{num}', 
'catagory{num}')
   ```
   
   After that I tried to query the data_files metadata.
   
   ```
select * from glue_dev.db.datatype_test.data_files limit 10;
   content  file_path   file_format spec_id record_count
file_size_in_bytes  column_sizesvalue_countsnull_value_counts   
nan_value_countslower_boundsupper_boundskey_metadata
split_offsets   equality_idssort_order_id
   0
s3://bucket/folder/db.db/datatype_test/data/0-0-4bb9ce80-c7a7-4192-98c6-ed6e7289a981-1.parquet
  PARQUET 0   559 4234{1:991,2:1188,3:1258}   {1:559,2:559,3:559}   
  {1:0,2:0,3:0}   {}  {1:,2:data1,3:catagory1}
{1:/,2:data99,3:catagory99} NULL[4] NULL0
   Time taken: 15.614 seconds, Fetched 1 row(s)
   ```
   
   In this metadata, if you see the lower_bounds and upper_bounds data for the 
id column which is of type bigint they dont represent the correct values.
   
   Does that mean iceberg is not storing the metadata correctly?
   In this case if i join using id column how iceberg will properly scan/skip 
files since the lower bound and upper bound metadata is not correct?
   
   This is a test data but i have actual data with around 1 files and they 
too exhibit similar behaviour for integer columns. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] lirui-apache commented on pull request #5206: Core: Defer reading Avro metadata until ManifestFile is read

2022-11-21 Thread GitBox


lirui-apache commented on PR #5206:
URL: https://github.com/apache/iceberg/pull/5206#issuecomment-1323119574

   We encountered authorization failures reading manifest files after applied 
this PR, and thought it might be related. Since the worker pool in use is by 
default a global static pool, the threads in this pool may not be able to see 
the correct UGI submitting the operation. And hit permission denied errors when 
trying to open the stream. I wonder whether we should call `doAs` in the pool, 
or whether we should let different users use separate pools?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] pvary commented on issue #6241: Updating the HiveCatalog Conf with setConf does not also reset FileIO

2022-11-21 Thread GitBox


pvary commented on issue #6241:
URL: https://github.com/apache/iceberg/issues/6241#issuecomment-1323146232

   With the current `Catalog` implementations it is probably better to recreate 
the catalogs anyway, as they are mostly a thin wrapper above the HMSClientPool.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] nastra commented on issue #6003: Vectorized Read

2022-11-21 Thread GitBox


nastra commented on issue #6003:
URL: https://github.com/apache/iceberg/issues/6003#issuecomment-1323178266

   This problem is specific to Arrow itself when running with JDK9+, because 
Arrow's `MemoryUtil` requires access to the mentioned Java module. See 
https://arrow.apache.org/docs/java/install.html#java-compatibility / 
https://github.com/apache/arrow/commit/0310add1776054a67aacfb1a7bc0f569a19c7628#diff-23f1adfa46ef0fe69d6902b25cd417114b089d692d0c313d3a9a2c8ceff7cd3d
 on how to resolve this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] psnilesh opened a new issue, #6245: For for issue #2796 is missing from 0.14.1 and 1.0.x releases

2022-11-21 Thread GitBox


psnilesh opened a new issue, #6245:
URL: https://github.com/apache/iceberg/issues/6245

   ### Apache Iceberg version
   
   1.0.0 (latest release)
   
   ### Query engine
   
   _No response_
   
   ### Please describe the bug ๐Ÿž
   
   Issue is same as https://github.com/apache/iceberg/issues/2796.  I see this 
is already fixed by @jfz  and merged into `master` but I don't see the fix in 
either `0.14.1` or `1.0.0` branch.  Is it a miss or deliberately not included 
for some reason ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] nastra commented on pull request #6169: AWS,Core: Add S3 REST Signer client + REST Spec

2022-11-21 Thread GitBox


nastra commented on PR #6169:
URL: https://github.com/apache/iceberg/pull/6169#issuecomment-1323218731

   I have moved the S3 Signer REST Spec to the `iceberg-aws` module.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] ajantha-bhat commented on issue #6245: Fix for issue #2796 is missing from 0.14.1 and 1.0.x releases

2022-11-21 Thread GitBox


ajantha-bhat commented on issue #6245:
URL: https://github.com/apache/iceberg/issues/6245#issuecomment-1323226313

   0.14.1 and 1.0.0 release is not from the master branch. It was based on 
0.14.0 branch. 
   The fix that you are expecting will be available in the upcoming 1.1.0 
release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] nastra commented on pull request #6240: Nessie: Refactor NessieTableOperations#doCommit

2022-11-21 Thread GitBox


nastra commented on PR #6240:
URL: https://github.com/apache/iceberg/pull/6240#issuecomment-1323227993

   cc: @dimas-b / @snazy can you guys review this one please?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] psnilesh commented on issue #6245: Fix for issue #2796 is missing from 0.14.1 and 1.0.x releases

2022-11-21 Thread GitBox


psnilesh commented on issue #6245:
URL: https://github.com/apache/iceberg/issues/6245#issuecomment-1323228826

   Thank you.  When can I expect 1.1.0 ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] ajantha-bhat commented on issue #6244: Iceberg metadata not stored properly

2022-11-21 Thread GitBox


ajantha-bhat commented on issue #6244:
URL: https://github.com/apache/iceberg/issues/6244#issuecomment-1323230663

   > Does that mean iceberg is not storing the metadata correctly?
   
   lower bounds and upper bounds are stored as byte arrays in the manifest 
files. Hence, what you are seeing is the bytearray string representation which 
is not human readable. But during file pruning these values are converted back 
to proper data types and compared. 
   
   We can enhance these metadata tables to show actual values. I believe 
@szehon-ho has an open PR for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] ajantha-bhat commented on issue #6244: Iceberg metadata not stored properly

2022-11-21 Thread GitBox


ajantha-bhat commented on issue #6244:
URL: https://github.com/apache/iceberg/issues/6244#issuecomment-1323232753

   As a workaround, I believe 
https://github.com/hililiwei/iceberg-tools#manifest2json can convert them and 
show them. 
   
   cc: @hililiwei 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] ajantha-bhat commented on issue #6245: Fix for issue #2796 is missing from 0.14.1 and 1.0.x releases

2022-11-21 Thread GitBox


ajantha-bhat commented on issue #6245:
URL: https://github.com/apache/iceberg/issues/6245#issuecomment-1323233889

   It is already open for voting. So within 3 to 4 days, the release will be 
available.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko merged pull request #6211: Allow dropping a column used by old SortOrders but not current SortOrder

2022-11-21 Thread GitBox


Fokko merged PR #6211:
URL: https://github.com/apache/iceberg/pull/6211


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko commented on issue #6204: Allow dropping a column used by old SortOrders

2022-11-21 Thread GitBox


Fokko commented on issue #6204:
URL: https://github.com/apache/iceberg/issues/6204#issuecomment-1323246249

   Thanks for double checking this ๐Ÿ‘๐Ÿป 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko closed issue #6204: Allow dropping a column used by old SortOrders

2022-11-21 Thread GitBox


Fokko closed issue #6204: Allow dropping a column used by old SortOrders
URL: https://github.com/apache/iceberg/issues/6204


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] ajantha-bhat commented on pull request #6240: Nessie: Refactor NessieTableOperations#doCommit

2022-11-21 Thread GitBox


ajantha-bhat commented on PR #6240:
URL: https://github.com/apache/iceberg/pull/6240#issuecomment-1323253673

   @nastra: Review from Iceberg side is easy. But I would like to know, can 
Trino really use it?
   Looks to me that `client#commitTable` needs the below parameters. 
   
   ```
   commitTable(
 TableMetadata base,
 TableMetadata metadata,
 String newMetadataLocation,
 IcebergTable expectedContent,
 ContentKey expectedContent)
   ```
   
   But the Trino interface (`NessieIcebergTableOperations#commitNewTable`) 
currently doesn't have `base`, `expectedContent` and these interfaces are 
common for other catalogs too. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org