[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality
nastra commented on code in PR #6274: URL: https://github.com/apache/iceberg/pull/6274#discussion_r1036804341 ## data/src/test/java/org/apache/iceberg/data/orc/TestOrcRowIterator.java: ## @@ -74,7 +75,7 @@ public void writeFile() throws IOException { .schema(DATA_SCHEMA) // write in such a way that the file contains 2 stripes each with 4 row groups of 1000 // rows -.set("iceberg.orc.vectorbatch.size", "1000") +.set(TableProperties.ORC_WRITE_BATCH_SIZE, "1000") Review Comment: `iceberg.orc.vectorbatch.size` was deprecated for quite some time. I've restored all code that removes deprecated properties and code that deals with legacy properties -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality
nastra commented on code in PR #6274: URL: https://github.com/apache/iceberg/pull/6274#discussion_r1036806161 ## spark/v3.1/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteMergeInto.scala: ## @@ -228,13 +225,6 @@ case class RewriteMergeInto(spark: SparkSession) extends Rule[LogicalPlan] with } } - private def isCardinalityCheckEnabled(table: Table): Boolean = { -PropertyUtil.propertyAsBoolean( - table.properties(), - MERGE_CARDINALITY_CHECK_ENABLED, Review Comment: I've restored this and updated the comment on the property itself to mention that we will drop it once Spark 3.1 support will be dropped -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality
nastra commented on code in PR #6274: URL: https://github.com/apache/iceberg/pull/6274#discussion_r1036809785 ## core/src/main/java/org/apache/iceberg/deletes/Deletes.java: ## @@ -83,21 +83,6 @@ public static CloseableIterable markDeleted( }); } - /** - * Returns the remaining rows (the ones that are not deleted), while counting the deleted ones. - * - * @param rows the rows to process - * @param isDeleted a predicate that determines if a row is deleted - * @return the processed rows - * @deprecated Will be removed in 1.2.0, use {@link Deletes#filterDeleted(CloseableIterable, - * Predicate, DeleteCounter)}. - */ - @Deprecated - public static CloseableIterable filterDeleted( Review Comment: This is functionality that was adjusted by #4588. We then re-introduced + deprecated the method in #6146 to not have breaking API changes. Also there's currently no place in the code that would use this, so it seems better to remove it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality
nastra commented on code in PR #6274: URL: https://github.com/apache/iceberg/pull/6274#discussion_r1036810161 ## core/src/main/java/org/apache/iceberg/TableProperties.java: ## @@ -342,19 +335,6 @@ private TableProperties() {} public static final String MERGE_MODE = "write.merge.mode"; public static final String MERGE_MODE_DEFAULT = RowLevelOperationMode.COPY_ON_WRITE.modeName(); - /** - * @deprecated will be removed in 0.14.0, the cardinality check is always performed starting from - * 0.13.0. - */ - @Deprecated - public static final String MERGE_CARDINALITY_CHECK_ENABLED = - "write.merge.cardinality-check.enabled"; - /** - * @deprecated will be removed in 0.14.0, the cardinality check is always performed starting from - * 0.13.0. - */ - @Deprecated public static final boolean MERGE_CARDINALITY_CHECK_ENABLED_DEFAULT = true; Review Comment: I've restored this property and mentioned that it can only be dropped once Spark 3.1 support is dropped -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality
nastra commented on code in PR #6274: URL: https://github.com/apache/iceberg/pull/6274#discussion_r1036815990 ## core/src/main/java/org/apache/iceberg/LocationProviders.java: ## @@ -84,10 +84,7 @@ static class DefaultLocationProvider implements LocationProvider { private static String dataLocation(Map properties, String tableLocation) { String dataLocation = properties.get(TableProperties.WRITE_DATA_LOCATION); if (dataLocation == null) { -dataLocation = properties.get(TableProperties.WRITE_FOLDER_STORAGE_LOCATION); -if (dataLocation == null) { - dataLocation = String.format("%s/data", tableLocation); -} +dataLocation = String.format("%s/data", tableLocation); Review Comment: I've restored this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality
nastra commented on code in PR #6274: URL: https://github.com/apache/iceberg/pull/6274#discussion_r1036815756 ## core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java: ## @@ -758,20 +758,6 @@ protected Map summary() { return summaryBuilder.build(); } - /** - * Apply the update's changes to the base table metadata and return the new manifest list. - * - * @param base the base table metadata to apply changes to - * @return a manifest list for the new snapshot. - * @deprecated Will be removed in 1.2.0, use {@link MergingSnapshotProducer#apply(TableMetadata, - * Snapshot)}. - */ - @Deprecated - @Override - public List apply(TableMetadata base) { Review Comment: This is code that changed this method in #4926. We then restored+deprecated the old API in #6146 for one minor release and are removing it here now. Let me know if we'd want to keep those methods, then we can just drop the `Deprecated` annotations from those -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality
nastra commented on code in PR #6274: URL: https://github.com/apache/iceberg/pull/6274#discussion_r1036817726 ## spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java: ## @@ -144,16 +143,16 @@ public void testInt96TimestampProducedBySparkIsReadCorrectly() throws IOExceptio }); List rows = Lists.newArrayList(RandomData.generateSpark(schema, 10, 0L)); -try (FileAppender writer = -new ParquetWriteAdapter<>( -new NativeSparkWriterBuilder(outputFile) -.set("org.apache.spark.sql.parquet.row.attributes", sparkSchema.json()) -.set("spark.sql.parquet.writeLegacyFormat", "false") -.set("spark.sql.parquet.outputTimestampType", "INT96") -.set("spark.sql.parquet.fieldId.write.enabled", "true") -.build(), -MetricsConfig.getDefault())) { - writer.addAll(rows); +try (ParquetWriter writer = Review Comment: this was using the deprecated `ParquetWriteAdapter` class, so I started changing code that uses this class so that we can eventually remove `ParquetWriteAdapter` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] lichaohao opened a new issue, #6330: iceberg : format-version=2 , when the job is running (insert and update), can not execute rewrite small data file ?
lichaohao opened a new issue, #6330: URL: https://github.com/apache/iceberg/issues/6330 ### Query engine iceberg:1.0.0 spark:3.2.0 flink:1.13.2 catalog:hive-catalog ### Question iceberg:1.0.0 spark:3.2.0 flink:1.13.2 catalog:hive-catalog source table:mysql cdc table: mysql_cdc_source sink table:iceberg table: my_iceberg_sink ==> primary key id ,format-version=2, write.upsert.enabled=true execute sql: (checkpoint 1min) upsert into my_iceberg_sink select * from mysql_cdc_source; ps: mysql exists insert and update operation when the job is running somet time, i want to rewrite the iceberg data file into bigger one, 【spark execute】 (call "hive_prod.system.rewrite_data_files(my_iceberg_sink)") get the following exception message: can not commit,found new position delete for replaced data file: GenericDataFilehdfs://data.parquet all above that what should i do can execute "call hive_prod.system.rewrite_data_files(my_iceberg_sink) " correctly? Thank you for your answer! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] SHuixo commented on issue #6330: iceberg : format-version=2 , when the job is running (insert and update), can not execute rewrite small data file ?
SHuixo commented on issue #6330: URL: https://github.com/apache/iceberg/issues/6330#issuecomment-1333486031 This situation is the same as I have encountered #6104, and in the current version, there is no effective way to support compressing historical data files containing delete operations during the data writing process. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ConeyLiu opened a new pull request, #6331: Port #4627 to Spark 2.4/3.1/3.2 and Flink 1.14/1.15
ConeyLiu opened a new pull request, #6331: URL: https://github.com/apache/iceberg/pull/6331 This PR just ported #4627 to Spark 2.4/3.1/3.2 and Flink 1.14/1.15. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] chenjunjiedada commented on pull request #6331: Port #4627 to Spark 2.4/3.1/3.2 and Flink 1.14/1.15
chenjunjiedada commented on PR #6331: URL: https://github.com/apache/iceberg/pull/6331#issuecomment-1333509100 What about 1.16? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on pull request #6331: Port #4627 to Spark 2.4/3.1/3.2 and Flink 1.14/1.15
ajantha-bhat commented on PR #6331: URL: https://github.com/apache/iceberg/pull/6331#issuecomment-1333580498 > What about 1.16? I think flink-1.16 and spark-3.3 is handled in the original PR itself. https://github.com/apache/iceberg/pull/4627 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on pull request #6331: Port #4627 to Spark 2.4/3.1/3.2 and Flink 1.14/1.15
ajantha-bhat commented on PR #6331: URL: https://github.com/apache/iceberg/pull/6331#issuecomment-1333582482 nit: we could at least split it into one spark and one flink PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on issue #6318: executor logs ton of `INFO CodecPool: Got brand-new decompressor [.zstd]`
nastra commented on issue #6318: URL: https://github.com/apache/iceberg/issues/6318#issuecomment-1333600955 I've looked at the code and this happened even before #5681. The decompressors are actually being re-used, but not across different Parquet Files. So this means in your case you have lots of Parquet files that are being read, and thus you're seeing this log output from Hadoop's `CodecPool` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] camper42 closed issue #6318: executor logs ton of `INFO CodecPool: Got brand-new decompressor [.zstd]`
camper42 closed issue #6318: executor logs ton of `INFO CodecPool: Got brand-new decompressor [.zstd]` URL: https://github.com/apache/iceberg/issues/6318 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] camper42 commented on issue #6318: executor logs ton of `INFO CodecPool: Got brand-new decompressor [.zstd]`
camper42 commented on issue #6318: URL: https://github.com/apache/iceberg/issues/6318#issuecomment-1333612142 thx, so for my scenario, I may need to compact more promptly and/or set the log level of the `CodecPool` to `WARN` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ConeyLiu commented on pull request #6331: Port #4627 to Spark 2.4/3.1/3.2 and Flink 1.14/1.15
ConeyLiu commented on PR #6331: URL: https://github.com/apache/iceberg/pull/6331#issuecomment-1333635843 Thanks @chenjunjiedada @ajantha-bhat for the review. > What about 1.16? It is already done in #4627. > nit: we could at least split it into one spark and one flink PR or one PR for each version. OK, let me split it into two patches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] chenjunjiedada commented on pull request #6331: Port #4627 to Spark 2.4/3.1/3.2 and Flink 1.14/1.15
chenjunjiedada commented on PR #6331: URL: https://github.com/apache/iceberg/pull/6331#issuecomment-1333640055 Well, I thought flink 1.16 was supported recently. Never thought the change was already in it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] loleek opened a new issue, #6332: Create tables error when using JDBC catalog and mysql backend
loleek opened a new issue, #6332: URL: https://github.com/apache/iceberg/issues/6332 ### Apache Iceberg version 1.1.0 (latest release) ### Query engine _No response_ ### Please describe the bug 🐞 When I use jdbc catalog with mysql backend in flink test case, the create catalog operation return an error. >Flink SQL> CREATE CATALOG iceberg_catalog WITH ( > 'type'='iceberg', > 'catalog-impl'='org.apache.iceberg.jdbc.JdbcCatalog', > 'catalog-name' = 'iceberg_catalog', > 'warehouse' = 'file:///path/to/warehouse', > 'uri' = 'jdbc:mysql://host:port/iceberg_catalog', > 'jdbc.user' = 'root', > 'jdbc.password' = 'passwd'); [ERROR] Could not execute SQL statement. Reason: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes. I think the root cause is here https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/jdbc/JdbcUtil.java#L179 which use CATALOG_NAME VARCHAR(255) + NAMESPACE_NAME VARCHAR(255) + NAMESPACE_PROPERTY_KEY VARCHAR(5500) as primary key. Here is a related issue https://github.com/apache/iceberg/pull/2778/files, which fix CREATE_CATALOG_TABLE statement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ConeyLiu opened a new pull request, #6333: Port #4627 to Flink 1.14/1.15
ConeyLiu opened a new pull request, #6333: URL: https://github.com/apache/iceberg/pull/6333 This PR just ported #4627 to Flink 1.14/1.15. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on issue #6332: Create tables error when using JDBC catalog and mysql backend
nastra commented on issue #6332: URL: https://github.com/apache/iceberg/issues/6332#issuecomment-1333663036 The `max key length limit=767` is actually a limit that is imposed by the underlying database being used (MySQL in this case). A valid approach would be to increase that limit on MySQL. Related issue: https://github.com/apache/iceberg/issues/6236 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra closed issue #6273: java.sql.SQLException: Access denied for user 'hive'@'hadoopSlave0' (using password: YES)
nastra closed issue #6273: java.sql.SQLException: Access denied for user 'hive'@'hadoopSlave0' (using password: YES) URL: https://github.com/apache/iceberg/issues/6273 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on issue #6273: java.sql.SQLException: Access denied for user 'hive'@'hadoopSlave0' (using password: YES)
nastra commented on issue #6273: URL: https://github.com/apache/iceberg/issues/6273#issuecomment-1333664922 I'll close this for now. Feel free to re-open if necessary -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on issue #4092: Not an iceberg table Error use hive catalog-type
nastra commented on issue #4092: URL: https://github.com/apache/iceberg/issues/4092#issuecomment-1333668098 Closing this for now. Feel free to re-open if necessary -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra closed issue #4092: Not an iceberg table Error use hive catalog-type
nastra closed issue #4092: Not an iceberg table Error use hive catalog-type URL: https://github.com/apache/iceberg/issues/4092 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on issue #6236: Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 3072 bytes
nastra commented on issue #6236: URL: https://github.com/apache/iceberg/issues/6236#issuecomment-1333670878 @yuangjiang you can probably take a look at https://dev.mysql.com/doc/refman/8.0/en/innodb-limits.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on issue #5027: JDBC Catalog create table DDL is not compatible for MYSQL 8.0.29
nastra commented on issue #5027: URL: https://github.com/apache/iceberg/issues/5027#issuecomment-1333676057 @noneback are you able to increase the limit on MySql? Usually different databases impose different limits, but I'll see if we can lower namespace properties from 5500. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko opened a new pull request, #6334: Python: Add warning on projection by name
Fokko opened a new pull request, #6334: URL: https://github.com/apache/iceberg/pull/6334 ```python ➜ python git:(master) ✗ python3 Python 3.10.8 (main, Oct 13 2022, 09:48:40) [Clang 14.0.0 (clang-1400.0.29.102)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from pyiceberg.catalog import load_catalog >>> >>> catalog = load_catalog('local') >>> >>> table = catalog.load_table(('nyc', 'taxis')) >>> >>> table.scan().to_arrow() /Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/table/__init__.py:340: UserWarning: Projection is currently done by name instead of column ID, this can lead to incorrect results in some cases. warnings.warn( pyarrow.Table VendorID: int64 tpep_pickup_datetime: timestamp[us, tz=+00:00] tpep_dropoff_datetime: timestamp[us, tz=+00:00] passenger_count: double trip_distance: double RatecodeID: double store_and_fwd_flag: string PULocationID: int64 DOLocationID: int64 payment_type: int64 fare_amount: double extra: double mta_tax: double tip_amount: double tolls_amount: double improvement_surcharge: double total_amount: double congestion_surcharge: double airport_fee: double ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hililiwei commented on a diff in pull request #5967: Flink: Support read options in flink source
hililiwei commented on code in PR #5967: URL: https://github.com/apache/iceberg/pull/5967#discussion_r1037105073 ## docs/flink-getting-started.md: ## @@ -683,7 +683,47 @@ env.execute("Test Iceberg DataStream"); OVERWRITE and UPSERT can't be set together. In UPSERT mode, if the table is partitioned, the partition fields should be included in equality fields. {{< /hint >}} -## Write options +## Options +### Read options + +Flink read options are passed when configuring the FlinkSink, like this: + +``` +IcebergSource.forRowData() +.tableLoader(tableResource.tableLoader()) +.assignerFactory(new SimpleSplitAssignerFactory()) +.streaming(scanContext.isStreaming()) +.streamingStartingStrategy(scanContext.streamingStartingStrategy()) +.startSnapshotTimestamp(scanContext.startSnapshotTimestamp()) +.startSnapshotId(scanContext.startSnapshotId()) +.set("monitor-interval", "10s") +.build() +``` +For Flink SQL, read options can be passed in via SQL hints like this: +``` +SELECT * FROM tableName /*+ OPTIONS('monitor-interval'='10s') */ +... +``` + +| Flink option| Default| Description | +| --- | -- | | +| snapshot-id || For time travel in batch mode. Read data from the specified snapshot-id. | +| case-sensitive | false | Whether the sql is case sensitive| +| as-of-timestamp || For time travel in batch mode. Read data from the most recent snapshot as of the given time in milliseconds. | +| starting-strategy | INCREMENTAL_FROM_LATEST_SNAPSHOT | Starting strategy for streaming execution. TABLE_SCAN_THEN_INCREMENTAL: Do a regular table scan then switch to the incremental mode. The incremental mode starts from the current snapshot exclusive. INCREMENTAL_FROM_LATEST_SNAPSHOT: Start incremental mode from the latest snapshot inclusive. If it is an empty map, all future append snapshots should be discovered. INCREMENTAL_FROM_EARLIEST_SNAPSHOT: Start incremental mode from the earliest snapshot inclusive. If it is an empty map, all future append snapshots should be discovered. INCREMENTAL_FROM_SNAPSHOT_ID: Start incremental mode from a snapshot with a specific id inclusive. INCREMENTAL_FROM_SNAPSHOT_TIMESTAMP: Start incremental mode from a snapshot with a specific timestamp inclusive. If the timestamp is between two snapshots, it should start from the snapshot after the timestamp. Just for FIP27 Source | +| start-snapshot-timestamp|| Start to read data from the most recent snapshot as of the given time in milliseconds. | +| start-snapshot-id || Start to read data from the specified snapshot-id. | +| end-snapshot-id | The latest snapshot id | Specifies the end snapshot. | +| split-size | Table read.split.target-size | Overrides this table's read.split.target-size| +| split-lookback | Table read.split.planning-lookback | Overrides this table's read.split.planning-lookback | +| split-file-open-cost| Table read.split.open-file-cost| Overrides this table's read.split.open-file-cost | +| streaming | false | Sets whether the current task runs in streaming or batch mode. | Review Comment: Actually, I think this parameter is a little redundant, why don't we use `execute.runtime-mode`? If we're going to make some breaking change, I'm inclined to deprecated it first and then delete it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hililiwei commented on a diff in pull request #5967: Flink: Support read options in flink source
hililiwei commented on code in PR #5967: URL: https://github.com/apache/iceberg/pull/5967#discussion_r1037109651 ## docs/flink-getting-started.md: ## @@ -683,7 +683,47 @@ env.execute("Test Iceberg DataStream"); OVERWRITE and UPSERT can't be set together. In UPSERT mode, if the table is partitioned, the partition fields should be included in equality fields. {{< /hint >}} -## Write options +## Options +### Read options + +Flink read options are passed when configuring the FlinkSink, like this: + +``` +IcebergSource.forRowData() +.tableLoader(tableResource.tableLoader()) +.assignerFactory(new SimpleSplitAssignerFactory()) +.streaming(scanContext.isStreaming()) +.streamingStartingStrategy(scanContext.streamingStartingStrategy()) +.startSnapshotTimestamp(scanContext.startSnapshotTimestamp()) +.startSnapshotId(scanContext.startSnapshotId()) +.set("monitor-interval", "10s") +.build() +``` +For Flink SQL, read options can be passed in via SQL hints like this: +``` +SELECT * FROM tableName /*+ OPTIONS('monitor-interval'='10s') */ +... +``` + +| Flink option| Default| Description | +| --- | -- | | +| snapshot-id || For time travel in batch mode. Read data from the specified snapshot-id. | +| case-sensitive | false | Whether the sql is case sensitive| +| as-of-timestamp || For time travel in batch mode. Read data from the most recent snapshot as of the given time in milliseconds. | +| starting-strategy | INCREMENTAL_FROM_LATEST_SNAPSHOT | Starting strategy for streaming execution. TABLE_SCAN_THEN_INCREMENTAL: Do a regular table scan then switch to the incremental mode. The incremental mode starts from the current snapshot exclusive. INCREMENTAL_FROM_LATEST_SNAPSHOT: Start incremental mode from the latest snapshot inclusive. If it is an empty map, all future append snapshots should be discovered. INCREMENTAL_FROM_EARLIEST_SNAPSHOT: Start incremental mode from the earliest snapshot inclusive. If it is an empty map, all future append snapshots should be discovered. INCREMENTAL_FROM_SNAPSHOT_ID: Start incremental mode from a snapshot with a specific id inclusive. INCREMENTAL_FROM_SNAPSHOT_TIMESTAMP: Start incremental mode from a snapshot with a specific timestamp inclusive. If the timestamp is between two snapshots, it should start from the snapshot after the timestamp. Just for FIP27 Source | +| start-snapshot-timestamp|| Start to read data from the most recent snapshot as of the given time in milliseconds. | +| start-snapshot-id || Start to read data from the specified snapshot-id. | +| end-snapshot-id | The latest snapshot id | Specifies the end snapshot. | +| split-size | Table read.split.target-size | Overrides this table's read.split.target-size| +| split-lookback | Table read.split.planning-lookback | Overrides this table's read.split.planning-lookback | +| split-file-open-cost| Table read.split.open-file-cost| Overrides this table's read.split.open-file-cost | +| streaming | false | Sets whether the current task runs in streaming or batch mode. | Review Comment: > it has real Flink batch-read usage in production, especially before Flink 1.16 version. Even version 1.16, which we assessed, still seems to be inadequate in batch. Our business guys have no incentive to migrate from spark to it. But I also tend not to change it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hililiwei commented on a diff in pull request #5967: Flink: Support read options in flink source
hililiwei commented on code in PR #5967: URL: https://github.com/apache/iceberg/pull/5967#discussion_r1037113844 ## docs/flink-getting-started.md: ## @@ -683,7 +683,47 @@ env.execute("Test Iceberg DataStream"); OVERWRITE and UPSERT can't be set together. In UPSERT mode, if the table is partitioned, the partition fields should be included in equality fields. {{< /hint >}} -## Write options +## Options +### Read options + +Flink read options are passed when configuring the FlinkSink, like this: + +``` +IcebergSource.forRowData() +.tableLoader(tableResource.tableLoader()) +.assignerFactory(new SimpleSplitAssignerFactory()) +.streaming(scanContext.isStreaming()) +.streamingStartingStrategy(scanContext.streamingStartingStrategy()) +.startSnapshotTimestamp(scanContext.startSnapshotTimestamp()) +.startSnapshotId(scanContext.startSnapshotId()) +.set("monitor-interval", "10s") +.build() +``` +For Flink SQL, read options can be passed in via SQL hints like this: +``` +SELECT * FROM tableName /*+ OPTIONS('monitor-interval'='10s') */ +... +``` + +| Flink option| Default| Description | +| --- | -- | | +| snapshot-id || For time travel in batch mode. Read data from the specified snapshot-id. | +| case-sensitive | false | Whether the sql is case sensitive| +| as-of-timestamp || For time travel in batch mode. Read data from the most recent snapshot as of the given time in milliseconds. | +| starting-strategy | INCREMENTAL_FROM_LATEST_SNAPSHOT | Starting strategy for streaming execution. TABLE_SCAN_THEN_INCREMENTAL: Do a regular table scan then switch to the incremental mode. The incremental mode starts from the current snapshot exclusive. INCREMENTAL_FROM_LATEST_SNAPSHOT: Start incremental mode from the latest snapshot inclusive. If it is an empty map, all future append snapshots should be discovered. INCREMENTAL_FROM_EARLIEST_SNAPSHOT: Start incremental mode from the earliest snapshot inclusive. If it is an empty map, all future append snapshots should be discovered. INCREMENTAL_FROM_SNAPSHOT_ID: Start incremental mode from a snapshot with a specific id inclusive. INCREMENTAL_FROM_SNAPSHOT_TIMESTAMP: Start incremental mode from a snapshot with a specific timestamp inclusive. If the timestamp is between two snapshots, it should start from the snapshot after the timestamp. Just for FIP27 Source | +| start-snapshot-timestamp|| Start to read data from the most recent snapshot as of the given time in milliseconds. | +| start-snapshot-id || Start to read data from the specified snapshot-id. | +| end-snapshot-id | The latest snapshot id | Specifies the end snapshot. | +| split-size | Table read.split.target-size | Overrides this table's read.split.target-size| +| split-lookback | Table read.split.planning-lookback | Overrides this table's read.split.planning-lookback | +| split-file-open-cost| Table read.split.open-file-cost| Overrides this table's read.split.open-file-cost | +| streaming | false | Sets whether the current task runs in streaming or batch mode. | +| monitor-interval| 10s| Interval for listening on the generation of new snapshots. | +| include-column-stats| false | Create a new scan from this that loads the column stats with each data file. Column stats include: value count, null value count, lower bounds, and upper bounds. | +| max-planning-snapshot-count | Integer.MAX_VALUE | If there are multiple new snapshots, configure the maximum number of snapshot forward at a time. | Review Comment: I have some conflicts here. If we make it smaller, will it take a long time to catch up to the latest snapshot when the data backlog is severe (such as after a restart)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceber
[GitHub] [iceberg] hililiwei commented on pull request #5967: Flink: Support read options in flink source
hililiwei commented on PR #5967: URL: https://github.com/apache/iceberg/pull/5967#issuecomment-1333783891 > > Hmm, I understand your concern now. The job-level configuration should have a connector prefix that makes sense to me, shall we consider using the same prefix? > > Yeah. I was worry about config naming collision. A common prefix like `connector.iceberg.` should be enough. But it will be a problem to change it for the write configs, which is already released. I'm in favor of adding a prefix for the new one. The old ones are left as is for compatibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] robinsinghstudios commented on issue #5977: How to write to a bucket-partitioned table using PySpark?
robinsinghstudios commented on issue #5977: URL: https://github.com/apache/iceberg/issues/5977#issuecomment-1333788692 Hi, My data is being appended to the partitioned table without registering any UDFs but the data seems to be written as one row per file which is creating a huge performance impact. Any suggestions or fixes will be really appreciated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ConeyLiu opened a new pull request, #6335: Core: Avoid generating a large ManifestFile when committing
ConeyLiu opened a new pull request, #6335: URL: https://github.com/apache/iceberg/pull/6335 In our production env, we noticed the manifest files have a large random size, ranging from several KB to larger than 100 MB. It seems the `MANIFEST_TARGET_SIZE_BYTES` has not worked during the commit phase. In this PR, we avoid generating a manifest file larger than `MANIFEST_TARGET_SIZE_BYTES` for newly added content files and will generate multiple manifest files when the size is covered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hililiwei commented on pull request #5967: Flink: Support read options in flink source
hililiwei commented on PR #5967: URL: https://github.com/apache/iceberg/pull/5967#issuecomment-1333793436 > 2. some of the read configs don't make sense to set in table environment configs, as they are tied to a specific source/table. How should handle this situation? > > ``` > public static final ConfigOption SNAPSHOT_ID = > ConfigOptions.key("snapshot-id").longType().defaultValue(null); > > public static final ConfigOption START_SNAPSHOT_ID = > ConfigOptions.key("start-snapshot-id").longType().defaultValue(null); > > public static final ConfigOption END_SNAPSHOT_ID = > ConfigOptions.key("end-snapshot-id").longType().defaultValue(null); > ``` They will not be set in the table environment. For example: ``` public Long startSnapshotId() { return confParser.longConf().option(FlinkReadOptions.START_SNAPSHOT_ID.key()).parseOptional(); } ``` It only works at job-level. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ConeyLiu commented on pull request #6335: Core: Avoid generating a large ManifestFile when committing
ConeyLiu commented on PR #6335: URL: https://github.com/apache/iceberg/pull/6335#issuecomment-1333794619 Hi @szehon-ho @rdblue @pvary @stevenzwu @chenjunjiedada pls help to review this when you are free. Thanks a lot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ConeyLiu commented on pull request #6331: Port #4627 to Spark 2.4/3.1/3.2
ConeyLiu commented on PR #6331: URL: https://github.com/apache/iceberg/pull/6331#issuecomment-1333796211 cc @szehon-ho @pvary @kbendick who reviewed the original pr. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ConeyLiu commented on pull request #6333: Port #4627 to Flink 1.14/1.15
ConeyLiu commented on PR #6333: URL: https://github.com/apache/iceberg/pull/6333#issuecomment-1333796294 cc @szehon-ho @pvary @kbendick who reviewed the original pr. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hililiwei opened a new pull request, #6336: Doc: Replace build with append in the Flink sink doc
hililiwei opened a new pull request, #6336: URL: https://github.com/apache/iceberg/pull/6336 `build()` has been removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hililiwei commented on a diff in pull request #6222: Flink: Support inspecting table
hililiwei commented on code in PR #6222: URL: https://github.com/apache/iceberg/pull/6222#discussion_r1037191643 ## flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/data/StructRowData.java: ## @@ -0,0 +1,340 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.data; + +import java.math.BigDecimal; +import java.nio.ByteBuffer; +import java.time.Duration; +import java.time.Instant; +import java.time.LocalDate; +import java.time.OffsetDateTime; +import java.util.Collection; +import java.util.List; +import java.util.Map; +import java.util.function.BiConsumer; +import java.util.function.Function; +import org.apache.flink.table.data.ArrayData; +import org.apache.flink.table.data.DecimalData; +import org.apache.flink.table.data.GenericArrayData; +import org.apache.flink.table.data.GenericMapData; +import org.apache.flink.table.data.MapData; +import org.apache.flink.table.data.RawValueData; +import org.apache.flink.table.data.RowData; +import org.apache.flink.table.data.StringData; +import org.apache.flink.table.data.TimestampData; +import org.apache.flink.types.RowKind; +import org.apache.iceberg.StructLike; +import org.apache.iceberg.flink.FlinkSchemaUtil; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.types.Type; +import org.apache.iceberg.types.Types; +import org.apache.iceberg.util.ByteBuffers; + +public class StructRowData implements RowData { + private final Types.StructType type; + private RowKind kind; + private StructLike struct; + + public StructRowData(Types.StructType type) { +this(type, RowKind.INSERT); + } + + public StructRowData(Types.StructType type, RowKind kind) { +this(type, null, kind); + } + + private StructRowData(Types.StructType type, StructLike struct) { +this(type, struct, RowKind.INSERT); + } + + private StructRowData(Types.StructType type, StructLike struct, RowKind kind) { +this.type = type; +this.struct = struct; +this.kind = kind; + } + + public StructRowData setStruct(StructLike newStruct) { +this.struct = newStruct; +return this; + } + + @Override + public int getArity() { +return struct.size(); + } + + @Override + public RowKind getRowKind() { +return kind; + } + + @Override + public void setRowKind(RowKind newKind) { +Preconditions.checkNotNull(newKind, "kind can not be null"); +this.kind = newKind; + } + + @Override + public boolean isNullAt(int pos) { +return struct.get(pos, Object.class) == null; + } + + @Override + public boolean getBoolean(int pos) { +return struct.get(pos, Boolean.class); + } + + @Override + public byte getByte(int pos) { +return (byte) (int) struct.get(pos, Integer.class); Review Comment: All types here are derived from https://github.com/apache/iceberg/blob/49a0ea956e3a4b9979754c887d803d4ab51131ae/api/src/main/java/org/apache/iceberg/types/Types.java#L40-L54 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hililiwei commented on a diff in pull request #6222: Flink: Support inspecting table
hililiwei commented on code in PR #6222: URL: https://github.com/apache/iceberg/pull/6222#discussion_r1037191643 ## flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/data/StructRowData.java: ## @@ -0,0 +1,340 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.data; + +import java.math.BigDecimal; +import java.nio.ByteBuffer; +import java.time.Duration; +import java.time.Instant; +import java.time.LocalDate; +import java.time.OffsetDateTime; +import java.util.Collection; +import java.util.List; +import java.util.Map; +import java.util.function.BiConsumer; +import java.util.function.Function; +import org.apache.flink.table.data.ArrayData; +import org.apache.flink.table.data.DecimalData; +import org.apache.flink.table.data.GenericArrayData; +import org.apache.flink.table.data.GenericMapData; +import org.apache.flink.table.data.MapData; +import org.apache.flink.table.data.RawValueData; +import org.apache.flink.table.data.RowData; +import org.apache.flink.table.data.StringData; +import org.apache.flink.table.data.TimestampData; +import org.apache.flink.types.RowKind; +import org.apache.iceberg.StructLike; +import org.apache.iceberg.flink.FlinkSchemaUtil; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.types.Type; +import org.apache.iceberg.types.Types; +import org.apache.iceberg.util.ByteBuffers; + +public class StructRowData implements RowData { + private final Types.StructType type; + private RowKind kind; + private StructLike struct; + + public StructRowData(Types.StructType type) { +this(type, RowKind.INSERT); + } + + public StructRowData(Types.StructType type, RowKind kind) { +this(type, null, kind); + } + + private StructRowData(Types.StructType type, StructLike struct) { +this(type, struct, RowKind.INSERT); + } + + private StructRowData(Types.StructType type, StructLike struct, RowKind kind) { +this.type = type; +this.struct = struct; +this.kind = kind; + } + + public StructRowData setStruct(StructLike newStruct) { +this.struct = newStruct; +return this; + } + + @Override + public int getArity() { +return struct.size(); + } + + @Override + public RowKind getRowKind() { +return kind; + } + + @Override + public void setRowKind(RowKind newKind) { +Preconditions.checkNotNull(newKind, "kind can not be null"); +this.kind = newKind; + } + + @Override + public boolean isNullAt(int pos) { +return struct.get(pos, Object.class) == null; + } + + @Override + public boolean getBoolean(int pos) { +return struct.get(pos, Boolean.class); + } + + @Override + public byte getByte(int pos) { +return (byte) (int) struct.get(pos, Integer.class); Review Comment: All types here are derived from https://github.com/apache/iceberg/blob/49a0ea956e3a4b9979754c887d803d4ab51131ae/api/src/main/java/org/apache/iceberg/types/Types.java#L40-L54 It doesn't has `byte` and `short` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] InvisibleProgrammer opened a new pull request, #6337: Update Iceberg Hive documentation
InvisibleProgrammer opened a new pull request, #6337: URL: https://github.com/apache/iceberg/pull/6337 Issue: https://github.com/apache/iceberg/issues/6249 This documentation contains the list of new features introduced in Hive 4.0.0-alpha-2. The only exception is https://issues.apache.org/jira/browse/HIVE-26077. It seems it was already supported in alpha-1 so I just refreshed the related part of the documentation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hililiwei commented on a diff in pull request #6222: Flink: Support inspecting table
hililiwei commented on code in PR #6222: URL: https://github.com/apache/iceberg/pull/6222#discussion_r1037252208 ## flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/data/StructRowData.java: ## @@ -0,0 +1,340 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.data; + +import java.math.BigDecimal; +import java.nio.ByteBuffer; +import java.time.Duration; +import java.time.Instant; +import java.time.LocalDate; +import java.time.OffsetDateTime; +import java.util.Collection; +import java.util.List; +import java.util.Map; +import java.util.function.BiConsumer; +import java.util.function.Function; +import org.apache.flink.table.data.ArrayData; +import org.apache.flink.table.data.DecimalData; +import org.apache.flink.table.data.GenericArrayData; +import org.apache.flink.table.data.GenericMapData; +import org.apache.flink.table.data.MapData; +import org.apache.flink.table.data.RawValueData; +import org.apache.flink.table.data.RowData; +import org.apache.flink.table.data.StringData; +import org.apache.flink.table.data.TimestampData; +import org.apache.flink.types.RowKind; +import org.apache.iceberg.StructLike; +import org.apache.iceberg.flink.FlinkSchemaUtil; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.types.Type; +import org.apache.iceberg.types.Types; +import org.apache.iceberg.util.ByteBuffers; + +public class StructRowData implements RowData { + private final Types.StructType type; + private RowKind kind; + private StructLike struct; + + public StructRowData(Types.StructType type) { +this(type, RowKind.INSERT); + } + + public StructRowData(Types.StructType type, RowKind kind) { +this(type, null, kind); + } + + private StructRowData(Types.StructType type, StructLike struct) { Review Comment: Yes. It used to handle the STRUCT/ROW type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hililiwei commented on a diff in pull request #6222: Flink: Support inspecting table
hililiwei commented on code in PR #6222: URL: https://github.com/apache/iceberg/pull/6222#discussion_r1037253800 ## flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/FlinkCatalog.java: ## @@ -148,6 +150,17 @@ private Namespace toNamespace(String database) { } TableIdentifier toIdentifier(ObjectPath path) { +String objectName = path.getObjectName(); +List tableName = Splitter.on('$').splitToList(objectName); +if (tableName.size() > 1 && MetadataTableType.from(tableName.get(1)) != null) { Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on issue #6326: estimateStatistics cost mush time to compute stats
RussellSpitzer commented on issue #6326: URL: https://github.com/apache/iceberg/issues/6326#issuecomment-1333946715 While I don't have a problem with disabling statistics reporting, I am pretty dubious this takes that long. What I believe you are actually seeing is the task list being created fort the first time and stored in a list. We use a lazy iterator which needs to be turned into a list before the job begins (even if statistics are not reported). This means even if we don't spend the time iterating the list when we are estimating stats, we will spend that same amount of time later when planning tasks. The only difference would be in the current case the second access to "tasks()" is cached so it's very fast. In this case the speed could probably be improved if the parallelism of the Manifest Reads was increased. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra opened a new pull request, #6338: Core: Use lower lengths for iceberg_namespace_properties / iceberg_tables in JdbcCatalog
nastra opened a new pull request, #6338: URL: https://github.com/apache/iceberg/pull/6338 Users are running into issues when hooking up the `JdbcCatalog` with `MySql` or other Databases, which actually impose lower limits than [sqlite](https://www.sqlite.org/limits.html) (which we use for testing the `JdbcCatalog`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on a diff in pull request #6338: Core: Use lower lengths for iceberg_namespace_properties / iceberg_tables in JdbcCatalog
nastra commented on code in PR #6338: URL: https://github.com/apache/iceberg/pull/6338#discussion_r1037273853 ## core/src/main/java/org/apache/iceberg/jdbc/JdbcUtil.java: ## @@ -185,9 +185,9 @@ final class JdbcUtil { + NAMESPACE_NAME + " VARCHAR(255) NOT NULL," + NAMESPACE_PROPERTY_KEY - + " VARCHAR(5500)," + + " VARCHAR(255)," Review Comment: CATALOG_NAME + NAMESPACE_NAME + NAMESPACE_PROPERTY_KEY make up the primary key, and the primary key size is limited by 767 ( 255 * 3 = 765 < 767) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on a diff in pull request #6338: Core: Use lower lengths for iceberg_namespace_properties / iceberg_tables in JdbcCatalog
nastra commented on code in PR #6338: URL: https://github.com/apache/iceberg/pull/6338#discussion_r1037276526 ## core/src/main/java/org/apache/iceberg/jdbc/JdbcUtil.java: ## @@ -72,9 +72,9 @@ final class JdbcUtil { + TABLE_NAME + " VARCHAR(255) NOT NULL," + METADATA_LOCATION - + " VARCHAR(5500)," + + " VARCHAR(1000)," Review Comment: arbitrarily chosen to be lower than 3072 bytes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on pull request #6338: Core: Use lower lengths for iceberg_namespace_properties / iceberg_tables in JdbcCatalog
nastra commented on PR #6338: URL: https://github.com/apache/iceberg/pull/6338#issuecomment-1333965778 @openinx given that you reviewed https://github.com/apache/iceberg/pull/2778 back then, could you review this one as well please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on issue #6332: Create tables error when using JDBC catalog and mysql backend
nastra commented on issue #6332: URL: https://github.com/apache/iceberg/issues/6332#issuecomment-1333969634 I have created to https://github.com/apache/iceberg/pull/6338 to make this less of an issue with MySql -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on issue #5027: JDBC Catalog create table DDL is not compatible for MYSQL 8.0.29
nastra commented on issue #5027: URL: https://github.com/apache/iceberg/issues/5027#issuecomment-1333969840 I have created to https://github.com/apache/iceberg/pull/6338 to make this less of an issue with MySql -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on issue #6236: Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 3072 bytes
nastra commented on issue #6236: URL: https://github.com/apache/iceberg/issues/6236#issuecomment-1333970109 I have created to https://github.com/apache/iceberg/pull/6338 to make this less of an issue with MySql -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on issue #5977: How to write to a bucket-partitioned table using PySpark?
RussellSpitzer commented on issue #5977: URL: https://github.com/apache/iceberg/issues/5977#issuecomment-1333976156 @robinsinghstudios data needs to be globally sorted on the bucketing function, at least that is my guess with those symptoms. Luckily this should all be less of an issue once the function catalog things in Spark 3.3 are all sorted out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi opened a new pull request, #6339: Spark 3.3: Add hours transform function
aokolnychyi opened a new pull request, #6339: URL: https://github.com/apache/iceberg/pull/6339 This PR adds `hours` transform function in Spark 3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on pull request #6339: Spark 3.3: Add hours transform function
aokolnychyi commented on PR #6339: URL: https://github.com/apache/iceberg/pull/6339#issuecomment-1334029975 cc @kbendick @rdblue @RussellSpitzer @szehon-ho @flyrain @gaborkaszab @nastra -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] danielcweeks merged pull request #6338: Core: Use lower lengths for iceberg_namespace_properties / iceberg_tables in JdbcCatalog
danielcweeks merged PR #6338: URL: https://github.com/apache/iceberg/pull/6338 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] danielcweeks closed issue #5027: JDBC Catalog create table DDL is not compatible for MYSQL 8.0.29
danielcweeks closed issue #5027: JDBC Catalog create table DDL is not compatible for MYSQL 8.0.29 URL: https://github.com/apache/iceberg/issues/5027 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] danielcweeks closed issue #6236: Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 3072 bytes
danielcweeks closed issue #6236: Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 3072 bytes URL: https://github.com/apache/iceberg/issues/6236 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] danielcweeks closed issue #6332: Create tables error when using JDBC catalog and mysql backend
danielcweeks closed issue #6332: Create tables error when using JDBC catalog and mysql backend URL: https://github.com/apache/iceberg/issues/6332 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] danielcweeks merged pull request #6322: Core: Fix NPE in CloseableIterable.close()
danielcweeks merged PR #6322: URL: https://github.com/apache/iceberg/pull/6322 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] danielcweeks merged pull request #6317: Core: MetadataUpdateParser should write updates/removals fields rather than updated/removed
danielcweeks merged PR #6317: URL: https://github.com/apache/iceberg/pull/6317 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] alesk opened a new issue, #6340: Deleting column from an iceberg table breaks schema in AWS Glue catalog
alesk opened a new issue, #6340: URL: https://github.com/apache/iceberg/issues/6340 ### Apache Iceberg version 1.1.0 (latest release) ### Query engine Spark ### Please describe the bug 🐞 When deleting a column with Spark SQL using `ALTER TABLE table_name DROP COLUMN column_name`, the following happens: 1) The column gets removed from the current schema in the Table's metadata on s3. 2) A new version of the schema is created in AWS Glue but the column is moved to the last position in the table instead of being removed. The two schemas have diverged. Querying with Spark works as expected but using Athena the wrong schema is reported thus breaking tools such as Metabase. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on issue #6340: Deleting a column from an iceberg table breaks schema in AWS Glue catalog
nastra commented on issue #6340: URL: https://github.com/apache/iceberg/issues/6340#issuecomment-1334099047 @amogh-jahagirdar could you please take a look at this one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6338: Core: Use lower lengths for iceberg_namespace_properties / iceberg_tables in JdbcCatalog
rdblue commented on code in PR #6338: URL: https://github.com/apache/iceberg/pull/6338#discussion_r1037388592 ## core/src/main/java/org/apache/iceberg/jdbc/JdbcUtil.java: ## @@ -72,9 +72,9 @@ final class JdbcUtil { + TABLE_NAME + " VARCHAR(255) NOT NULL," + METADATA_LOCATION - + " VARCHAR(5500)," + + " VARCHAR(1000)," Review Comment: Why not make this as long as we can to avoid truncation? What happens if a value is truncated? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi merged pull request #6289: Spark 2.4: Preserve file seq numbers while rewriting manifests
aokolnychyi merged PR #6289: URL: https://github.com/apache/iceberg/pull/6289 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on pull request #6289: Spark 2.4: Preserve file seq numbers while rewriting manifests
aokolnychyi commented on PR #6289: URL: https://github.com/apache/iceberg/pull/6289#issuecomment-1334119345 Thanks, @nastra! Sorry for the delay. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality
aokolnychyi commented on code in PR #6274: URL: https://github.com/apache/iceberg/pull/6274#discussion_r1037396694 ## spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteManifestsSparkAction.java: ## @@ -356,7 +356,7 @@ private static ManifestFile writeManifest( long snapshotId = row.getLong(0); long sequenceNumber = row.getLong(1); Row file = row.getStruct(2); -writer.existing(wrapper.wrap(file), snapshotId, sequenceNumber); +writer.existing(wrapper.wrap(file), snapshotId, sequenceNumber, null); Review Comment: Yep, thanks. I merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on pull request #6274: Core|ORC|Spark: Remove deprecated functionality
aokolnychyi commented on PR #6274: URL: https://github.com/apache/iceberg/pull/6274#issuecomment-1334121747 I'll take a look at seq number related changes today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6309: Spark 3.3: Consume arbitrary scans in SparkBatchQueryScan
aokolnychyi commented on code in PR #6309: URL: https://github.com/apache/iceberg/pull/6309#discussion_r1037409951 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java: ## @@ -63,7 +64,7 @@ class SparkBatchQueryScan extends SparkScan implements SupportsRuntimeFiltering private static final Logger LOG = LoggerFactory.getLogger(SparkBatchQueryScan.class); - private final TableScan scan; + private final Scan scan; Review Comment: Changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6309: Spark 3.3: Consume arbitrary scans in SparkBatchQueryScan
aokolnychyi commented on code in PR #6309: URL: https://github.com/apache/iceberg/pull/6309#discussion_r1037410284 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java: ## @@ -213,40 +215,53 @@ public Scan build() { Schema expectedSchema = schemaWithMetadataColumns(); -TableScan scan = -table -.newScan() -.caseSensitive(caseSensitive) -.filter(filterExpression()) -.project(expectedSchema); +if (startSnapshotId == null) { + BatchScan scan = Review Comment: Went ahead and added separate methods in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on pull request #6309: Spark 3.3: Consume arbitrary scans in SparkBatchQueryScan
aokolnychyi commented on PR #6309: URL: https://github.com/apache/iceberg/pull/6309#issuecomment-1334137970 I tried adding boundaries but it was mostly useless as we need to support arbitrary types. The current approach seems most reasonable at the moment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue merged pull request #6328: Python: Set version to 0.2.0
rdblue merged PR #6328: URL: https://github.com/apache/iceberg/pull/6328 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue merged pull request #6334: Python: Add warning on projection by name
rdblue merged PR #6334: URL: https://github.com/apache/iceberg/pull/6334 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality
nastra commented on code in PR #6274: URL: https://github.com/apache/iceberg/pull/6274#discussion_r1037420331 ## spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteManifestsSparkAction.java: ## @@ -356,7 +356,7 @@ private static ManifestFile writeManifest( long snapshotId = row.getLong(0); long sequenceNumber = row.getLong(1); Row file = row.getStruct(2); -writer.existing(wrapper.wrap(file), snapshotId, sequenceNumber); +writer.existing(wrapper.wrap(file), snapshotId, sequenceNumber, null); Review Comment: great, thanks. I'll rebase this PR tomorrow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6309: Spark 3.3: Consume arbitrary scans in SparkBatchQueryScan
RussellSpitzer commented on code in PR #6309: URL: https://github.com/apache/iceberg/pull/6309#discussion_r1037424678 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java: ## @@ -211,11 +213,19 @@ public Scan build() { SparkReadOptions.END_SNAPSHOT_ID, SparkReadOptions.START_SNAPSHOT_ID); +if (startSnapshotId != null) { + return buildIncrementalAppendScan(startSnapshotId, endSnapshotId); Review Comment: I do think this is cleaner to me as a code reviewer 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on pull request #6314: Core: Re-add and deprecate HMS_TABLE_OWNER to TableProperties
szehon-ho commented on PR #6314: URL: https://github.com/apache/iceberg/pull/6314#issuecomment-1334176715 Thanks a lot for the fix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi merged pull request #6339: Spark 3.3: Add hours transform function
aokolnychyi merged PR #6339: URL: https://github.com/apache/iceberg/pull/6339 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on pull request #6339: Spark 3.3: Add hours transform function
aokolnychyi commented on PR #6339: URL: https://github.com/apache/iceberg/pull/6339#issuecomment-1334220935 Thanks, @RussellSpitzer @nastra! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on a diff in pull request #5376: Core: Add readable metrics columns to files metadata tables
szehon-ho commented on code in PR #5376: URL: https://github.com/apache/iceberg/pull/5376#discussion_r1037486854 ## core/src/main/java/org/apache/iceberg/MetricsUtil.java: ## @@ -56,4 +72,270 @@ public static MetricsModes.MetricsMode metricsMode( String columnName = inputSchema.findColumnName(fieldId); return metricsConfig.columnMode(columnName); } + + public static final List READABLE_COL_METRICS = + ImmutableList.of( + new ReadableMetricCol("column_size", f -> Types.LongType.get(), "Total size on disk"), + new ReadableMetricCol( + "value_count", f -> Types.LongType.get(), "Total count, including null and NaN"), + new ReadableMetricCol("null_value_count", f -> Types.LongType.get(), "Null value count"), + new ReadableMetricCol("nan_value_count", f -> Types.LongType.get(), "NaN value count"), + new ReadableMetricCol("lower_bound", Types.NestedField::type, "Lower bound"), + new ReadableMetricCol("upper_bound", Types.NestedField::type, "Upper bound")); + + public static final String READABLE_METRICS = "readable_metrics"; + + public static class ReadableMetricCol { +private final String name; +private final Function typeFunction; +private final String doc; + +ReadableMetricCol(String name, Function typeFunction, String doc) { + this.name = name; + this.typeFunction = typeFunction; + this.doc = doc; +} + +String name() { + return name; +} + +Type type(Types.NestedField field) { + return typeFunction.apply(field); +} + +String doc() { + return doc; +} + } + + /** + * Represents a struct of metrics for a primitive column + * + * @param primitive column type + */ + public static class ReadableColMetricsStruct implements StructLike { + +private final String columnName; +private final Long columnSize; +private final Long valueCount; +private final Long nullValueCount; +private final Long nanValueCount; +private final T lowerBound; +private final T upperBound; +private final Map projectionMap; + +public ReadableColMetricsStruct( +String columnName, +Long columnSize, +Long valueCount, +Long nullValueCount, +Long nanValueCount, +T lowerBound, +T upperBound, +Types.NestedField projection) { + this.columnName = columnName; + this.columnSize = columnSize; + this.valueCount = valueCount; + this.nullValueCount = nullValueCount; + this.nanValueCount = nanValueCount; + this.lowerBound = lowerBound; + this.upperBound = upperBound; + this.projectionMap = readableMetricsProjection(projection); +} + +@Override +public int size() { + return projectionMap.size(); +} + +@Override +public T get(int pos, Class javaClass) { + Object value = get(pos); + return value == null ? null : javaClass.cast(value); +} + +@Override +public void set(int pos, T value) { + throw new UnsupportedOperationException("ReadableMetricsStruct is read only"); +} + +private Object get(int pos) { + int projectedPos = projectionMap.get(pos); + switch (projectedPos) { +case 0: + return columnSize; +case 1: + return valueCount; +case 2: + return nullValueCount; +case 3: + return nanValueCount; +case 4: + return lowerBound; +case 5: + return upperBound; +default: + throw new IllegalArgumentException( + String.format("Invalid projected pos %d", projectedPos)); + } +} + +/** @return map of projected position to actual position of this struct's fields */ +private Map readableMetricsProjection(Types.NestedField projection) { + Map result = Maps.newHashMap(); + + Set projectedFields = + Sets.newHashSet( + projection.type().asStructType().fields().stream() + .map(Types.NestedField::name) + .collect(Collectors.toSet())); + + int projectedIndex = 0; + for (int fieldIndex = 0; fieldIndex < READABLE_COL_METRICS.size(); fieldIndex++) { +ReadableMetricCol readableMetric = READABLE_COL_METRICS.get(fieldIndex); + +if (projectedFields.contains(readableMetric.name())) { + result.put(projectedIndex, fieldIndex); + projectedIndex++; +} + } + return result; +} + +String columnName() { + return columnName; +} + } + + /** + * Represents a struct, consisting of all {@link ReadableColMetricsStruct} for all primitive + * columns of the table + */ + public static class ReadableMetricsStruct implements StructLike { + +private final List columnMetrics; + +public ReadableMetricsStruct(List columnMetrics) { + this.columnMetrics = columnMetrics; +} + +@Override
[GitHub] [iceberg] flyrain commented on a diff in pull request #6012: Spark 3.3: Add a procedure to generate table changes
flyrain commented on code in PR #6012: URL: https://github.com/apache/iceberg/pull/6012#discussion_r1037488519 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java: ## @@ -0,0 +1,271 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.procedures; + +import java.util.Arrays; +import java.util.UUID; +import org.apache.iceberg.ChangelogOperation; +import org.apache.iceberg.MetadataColumns; +import org.apache.iceberg.Snapshot; +import org.apache.iceberg.Table; +import org.apache.iceberg.spark.SparkReadOptions; +import org.apache.iceberg.spark.source.SparkChangelogTable; +import org.apache.iceberg.util.DateTimeUtil; +import org.apache.iceberg.util.SnapshotUtil; +import org.apache.spark.sql.Column; +import org.apache.spark.sql.DataFrameReader; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.connector.catalog.Identifier; +import org.apache.spark.sql.connector.catalog.TableCatalog; +import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter; +import org.apache.spark.sql.expressions.Window; +import org.apache.spark.sql.functions; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.unsafe.types.UTF8String; +import org.jetbrains.annotations.NotNull; + +public class GenerateChangesProcedure extends BaseProcedure { + + public static SparkProcedures.ProcedureBuilder builder() { +return new BaseProcedure.Builder() { + @Override + protected GenerateChangesProcedure doBuild() { +return new GenerateChangesProcedure(tableCatalog()); + } +}; + } + + private static final ProcedureParameter[] PARAMETERS = + new ProcedureParameter[] { +ProcedureParameter.required("table", DataTypes.StringType), +// the snapshot ids input are ignored when the start/end timestamps are provided +ProcedureParameter.optional("start_snapshot_id_exclusive", DataTypes.LongType), +ProcedureParameter.optional("end_snapshot_id_inclusive", DataTypes.LongType), +ProcedureParameter.optional("table_change_view", DataTypes.StringType), +ProcedureParameter.optional("identifier_columns", DataTypes.StringType), +ProcedureParameter.optional("start_timestamp", DataTypes.TimestampType), +ProcedureParameter.optional("end_timestamp", DataTypes.TimestampType), Review Comment: Changed to `options` in the procedure. Will add the timestamp range in another PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] flyrain commented on a diff in pull request #6012: Spark 3.3: Add a procedure to generate table changes
flyrain commented on code in PR #6012: URL: https://github.com/apache/iceberg/pull/6012#discussion_r1037489027 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java: ## @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.procedures; + +import static org.apache.iceberg.ChangelogOperation.DELETE; +import static org.apache.iceberg.ChangelogOperation.INSERT; + +import java.util.Arrays; +import java.util.Iterator; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; +import org.apache.iceberg.ChangelogOperation; +import org.apache.iceberg.MetadataColumns; +import org.apache.iceberg.Snapshot; +import org.apache.iceberg.Table; +import org.apache.iceberg.relocated.com.google.common.collect.Iterators; +import org.apache.iceberg.spark.SparkReadOptions; +import org.apache.iceberg.spark.source.SparkChangelogTable; +import org.apache.iceberg.util.DateTimeUtil; +import org.apache.iceberg.util.SnapshotUtil; +import org.apache.spark.api.java.function.MapPartitionsFunction; +import org.apache.spark.sql.Column; +import org.apache.spark.sql.DataFrameReader; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.RowFactory; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.encoders.RowEncoder; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; +import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema; +import org.apache.spark.sql.connector.catalog.Identifier; +import org.apache.spark.sql.connector.catalog.TableCatalog; +import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.unsafe.types.UTF8String; +import org.jetbrains.annotations.NotNull; + +public class GenerateChangesProcedure extends BaseProcedure { + + public static SparkProcedures.ProcedureBuilder builder() { +return new BaseProcedure.Builder() { + @Override + protected GenerateChangesProcedure doBuild() { +return new GenerateChangesProcedure(tableCatalog()); + } +}; + } + + private static final ProcedureParameter[] PARAMETERS = + new ProcedureParameter[] { +ProcedureParameter.required("table", DataTypes.StringType), +// the snapshot ids input are ignored when the start/end timestamps are provided +ProcedureParameter.optional("start_snapshot_id_exclusive", DataTypes.LongType), +ProcedureParameter.optional("end_snapshot_id_inclusive", DataTypes.LongType), +ProcedureParameter.optional("table_change_view", DataTypes.StringType), +ProcedureParameter.optional("identifier_columns", DataTypes.StringType), +ProcedureParameter.optional("start_timestamp", DataTypes.TimestampType), +ProcedureParameter.optional("end_timestamp", DataTypes.TimestampType), + }; + + private static final StructType OUTPUT_TYPE = + new StructType( + new StructField[] { +new StructField("view_name", DataTypes.StringType, false, Metadata.empty()) + }); + + private GenerateChangesProcedure(TableCatalog tableCatalog) { +super(tableCatalog); + } + + @Override + public ProcedureParameter[] parameters() { +return PARAMETERS; + } + + @Override + public StructType outputType() { +return OUTPUT_TYPE; + } + + @Override + public InternalRow[] call(InternalRow args) { +String tableName = args.getString(0); + +// Read data from the table.changes +Dataset df = changelogRecords(tableName, args); + +// Compute the pre-image and post-images if the identifier columns are provided. +if (!args.isNullAt(4)) { + String[] identifierColumns = args.getString(4).split(","); + if (identifierColumns.length > 0) { +df = withUpdate(df, identifierColumns); + } +} + +String viewName = viewName(args, tableName); + +// Create a view for users to query +df.creat
[GitHub] [iceberg] flyrain commented on a diff in pull request #6012: Spark 3.3: Add a procedure to generate table changes
flyrain commented on code in PR #6012: URL: https://github.com/apache/iceberg/pull/6012#discussion_r1037489658 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java: ## @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.procedures; + +import java.util.Arrays; +import java.util.Iterator; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; +import org.apache.iceberg.ChangelogOperation; +import org.apache.iceberg.MetadataColumns; +import org.apache.iceberg.Snapshot; +import org.apache.iceberg.Table; +import org.apache.iceberg.relocated.com.google.common.collect.Iterators; +import org.apache.iceberg.spark.SparkReadOptions; +import org.apache.iceberg.spark.source.SparkChangelogTable; +import org.apache.iceberg.util.DateTimeUtil; +import org.apache.iceberg.util.SnapshotUtil; +import org.apache.spark.api.java.function.MapPartitionsFunction; +import org.apache.spark.sql.Column; +import org.apache.spark.sql.DataFrameReader; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.RowFactory; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.encoders.RowEncoder; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; +import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema; +import org.apache.spark.sql.connector.catalog.Identifier; +import org.apache.spark.sql.connector.catalog.TableCatalog; +import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.unsafe.types.UTF8String; +import org.jetbrains.annotations.NotNull; + +public class GenerateChangesProcedure extends BaseProcedure { + + public static SparkProcedures.ProcedureBuilder builder() { +return new BaseProcedure.Builder() { + @Override + protected GenerateChangesProcedure doBuild() { +return new GenerateChangesProcedure(tableCatalog()); + } +}; + } + + private static final ProcedureParameter[] PARAMETERS = + new ProcedureParameter[] { Review Comment: Made the change per suggestion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi merged pull request #6309: Spark 3.3: Consume arbitrary scans in SparkBatchQueryScan
aokolnychyi merged PR #6309: URL: https://github.com/apache/iceberg/pull/6309 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on pull request #6309: Spark 3.3: Consume arbitrary scans in SparkBatchQueryScan
aokolnychyi commented on PR #6309: URL: https://github.com/apache/iceberg/pull/6309#issuecomment-1334290414 Thanks, @RussellSpitzer! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko opened a new pull request, #6342: Python: Introduce SchemaVisitorPerPrimitiveType
Fokko opened a new pull request, #6342: URL: https://github.com/apache/iceberg/pull/6342 Instead of having another visitor go over the primitives, I think it is nicer to have an extended schema visitor that also goes over the primitive types -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi opened a new pull request, #6343: Spark 3.3: Remove unused RowDataRewriter
aokolnychyi opened a new pull request, #6343: URL: https://github.com/apache/iceberg/pull/6343 This PR removed no longer used `RowDataRewriter`. This class was needed for initial data compaction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] flyrain commented on a diff in pull request #6012: Spark 3.3: Add a procedure to generate table changes
flyrain commented on code in PR #6012: URL: https://github.com/apache/iceberg/pull/6012#discussion_r1037552480 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java: ## @@ -53,6 +53,7 @@ private static Map> initProcedureBuilders() { mapBuilder.put("ancestors_of", AncestorsOfProcedure::builder); mapBuilder.put("register_table", RegisterTableProcedure::builder); mapBuilder.put("publish_changes", PublishChangesProcedure::builder); +mapBuilder.put("generate_changes", GenerateChangesProcedure::builder); Review Comment: Other options like `register_change_view` `create_changelog_view` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on a diff in pull request #5376: Core: Add readable metrics columns to files metadata tables
szehon-ho commented on code in PR #5376: URL: https://github.com/apache/iceberg/pull/5376#discussion_r1037554727 ## api/src/main/java/org/apache/iceberg/DataFile.java: ## @@ -102,7 +102,8 @@ public interface DataFile extends ContentFile { int PARTITION_ID = 102; String PARTITION_NAME = "partition"; String PARTITION_DOC = "Partition data tuple, schema based on the partition spec"; - // NEXT ID TO ASSIGN: 142 + + int NEXT_ID_TO_ASSIGN = 142; Review Comment: Changed to find the highest existing id from files table schema, and then assign ids from there -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on a diff in pull request #5376: Core: Add readable metrics columns to files metadata tables
szehon-ho commented on code in PR #5376: URL: https://github.com/apache/iceberg/pull/5376#discussion_r1037554935 ## spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java: ## @@ -817,4 +824,93 @@ public static Set reachableManifestPaths(Table table) { .map(ManifestFile::path) .collect(Collectors.toSet()); } + + public static GenericData.Record asMetadataRecordWithMetrics( + Table dataTable, GenericData.Record file) { +return asMetadataRecordWithMetrics(dataTable, file, FileContent.DATA); + } + + public static GenericData.Record asMetadataRecordWithMetrics( + Table dataTable, GenericData.Record file, FileContent content) { + +Table filesTable = +MetadataTableUtils.createMetadataTableInstance(dataTable, MetadataTableType.FILES); + +GenericData.Record record = +new GenericData.Record(AvroSchemaUtil.convert(filesTable.schema(), "dummy")); +boolean isPartitioned = Partitioning.partitionType(dataTable).fields().size() != 0; +int filesFields = isPartitioned ? 17 : 16; +for (int i = 0; i < filesFields; i++) { + if (i == 0) { +record.put(0, content.id()); + } else if (i == 3) { +record.put(3, 0); // spec id + } else { +record.put(i, file.get(i)); + } +} +record.put( +isPartitioned ? 17 : 16, +expectedReadableMetrics( Review Comment: Did the select to simplify the tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi merged pull request #6343: Spark 3.3: Remove unused RowDataRewriter
aokolnychyi merged PR #6343: URL: https://github.com/apache/iceberg/pull/6343 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] abmo-x closed pull request #6301: Update Schema - should check if field is optional/required
abmo-x closed pull request #6301: Update Schema - should check if field is optional/required URL: https://github.com/apache/iceberg/pull/6301 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rbalamohan commented on issue #6326: estimateStatistics cost mush time to compute stats
rbalamohan commented on issue #6326: URL: https://github.com/apache/iceberg/issues/6326#issuecomment-1334526348 Check if increasing "iceberg.worker.num-threads" helps in this case. Default should be the number of processors available in the system. This can be increased by setting it as system property (try sending it via spark driver/executor jvm opt). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] github-actions[bot] commented on issue #3705: Explore spark struct streaming write iceberg and synchronize to hive Metastore
github-actions[bot] commented on issue #3705: URL: https://github.com/apache/iceberg/issues/3705#issuecomment-1334604152 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] github-actions[bot] closed issue #3705: Explore spark struct streaming write iceberg and synchronize to hive Metastore
github-actions[bot] closed issue #3705: Explore spark struct streaming write iceberg and synchronize to hive Metastore URL: https://github.com/apache/iceberg/issues/3705 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] stevenzwu commented on pull request #5967: Flink: Support read options in flink source
stevenzwu commented on PR #5967: URL: https://github.com/apache/iceberg/pull/5967#issuecomment-1334609528 @hililiwei seems that the only remaining task is to add prefix like `connector.iceberg.` to the configs coming from Flink job configuration. everything else are clarified and aligned. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on a diff in pull request #5376: Core: Add readable metrics columns to files metadata tables
szehon-ho commented on code in PR #5376: URL: https://github.com/apache/iceberg/pull/5376#discussion_r1037681151 ## core/src/main/java/org/apache/iceberg/MetricsUtil.java: ## @@ -56,4 +72,270 @@ public static MetricsModes.MetricsMode metricsMode( String columnName = inputSchema.findColumnName(fieldId); return metricsConfig.columnMode(columnName); } + + public static final List READABLE_COL_METRICS = + ImmutableList.of( + new ReadableMetricCol("column_size", f -> Types.LongType.get(), "Total size on disk"), + new ReadableMetricCol( + "value_count", f -> Types.LongType.get(), "Total count, including null and NaN"), + new ReadableMetricCol("null_value_count", f -> Types.LongType.get(), "Null value count"), + new ReadableMetricCol("nan_value_count", f -> Types.LongType.get(), "NaN value count"), + new ReadableMetricCol("lower_bound", Types.NestedField::type, "Lower bound"), + new ReadableMetricCol("upper_bound", Types.NestedField::type, "Upper bound")); + + public static final String READABLE_METRICS = "readable_metrics"; + + public static class ReadableMetricCol { +private final String name; +private final Function typeFunction; +private final String doc; + +ReadableMetricCol(String name, Function typeFunction, String doc) { + this.name = name; + this.typeFunction = typeFunction; + this.doc = doc; +} + +String name() { + return name; +} + +Type type(Types.NestedField field) { + return typeFunction.apply(field); +} + +String doc() { + return doc; +} + } + + /** + * Represents a struct of metrics for a primitive column + * + * @param primitive column type + */ + public static class ReadableColMetricsStruct implements StructLike { + +private final String columnName; +private final Long columnSize; +private final Long valueCount; +private final Long nullValueCount; +private final Long nanValueCount; +private final T lowerBound; +private final T upperBound; +private final Map projectionMap; + +public ReadableColMetricsStruct( +String columnName, +Long columnSize, +Long valueCount, +Long nullValueCount, +Long nanValueCount, +T lowerBound, +T upperBound, +Types.NestedField projection) { + this.columnName = columnName; + this.columnSize = columnSize; + this.valueCount = valueCount; + this.nullValueCount = nullValueCount; + this.nanValueCount = nanValueCount; + this.lowerBound = lowerBound; + this.upperBound = upperBound; + this.projectionMap = readableMetricsProjection(projection); +} + +@Override +public int size() { + return projectionMap.size(); +} + +@Override +public T get(int pos, Class javaClass) { + Object value = get(pos); + return value == null ? null : javaClass.cast(value); +} + +@Override +public void set(int pos, T value) { + throw new UnsupportedOperationException("ReadableMetricsStruct is read only"); +} + +private Object get(int pos) { + int projectedPos = projectionMap.get(pos); + switch (projectedPos) { +case 0: + return columnSize; +case 1: + return valueCount; +case 2: + return nullValueCount; +case 3: + return nanValueCount; +case 4: + return lowerBound; +case 5: + return upperBound; +default: + throw new IllegalArgumentException( + String.format("Invalid projected pos %d", projectedPos)); + } +} + +/** @return map of projected position to actual position of this struct's fields */ +private Map readableMetricsProjection(Types.NestedField projection) { + Map result = Maps.newHashMap(); + + Set projectedFields = + Sets.newHashSet( + projection.type().asStructType().fields().stream() + .map(Types.NestedField::name) + .collect(Collectors.toSet())); + + int projectedIndex = 0; + for (int fieldIndex = 0; fieldIndex < READABLE_COL_METRICS.size(); fieldIndex++) { +ReadableMetricCol readableMetric = READABLE_COL_METRICS.get(fieldIndex); + +if (projectedFields.contains(readableMetric.name())) { + result.put(projectedIndex, fieldIndex); + projectedIndex++; +} + } + return result; +} + +String columnName() { + return columnName; +} + } + + /** + * Represents a struct, consisting of all {@link ReadableColMetricsStruct} for all primitive + * columns of the table + */ + public static class ReadableMetricsStruct implements StructLike { + +private final List columnMetrics; + +public ReadableMetricsStruct(List columnMetrics) { + this.columnMetrics = columnMetrics; +} + +@Override
[GitHub] [iceberg] aokolnychyi opened a new pull request, #6345: Spark 3.3: Choose readers based on task types
aokolnychyi opened a new pull request, #6345: URL: https://github.com/apache/iceberg/pull/6345 This PR adds `SparkPartitionReaderFactory` that creates readers based on tasks in input partitions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6345: Spark 3.3: Choose readers based on task types
aokolnychyi commented on code in PR #6345: URL: https://github.com/apache/iceberg/pull/6345#discussion_r1037687769 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java: ## @@ -28,21 +28,48 @@ import org.apache.iceberg.io.CloseableIterator; import org.apache.iceberg.io.InputFile; import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.spark.source.metrics.TaskNumDeletes; +import org.apache.iceberg.spark.source.metrics.TaskNumSplits; import org.apache.spark.rdd.InputFileBlockHolder; +import org.apache.spark.sql.connector.metric.CustomTaskMetric; +import org.apache.spark.sql.connector.read.PartitionReader; import org.apache.spark.sql.vectorized.ColumnarBatch; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -class BatchDataReader extends BaseBatchReader { +class BatchDataReader extends BaseBatchReader Review Comment: This class was only used as `PartitionReader` in `SparkScan`, where we extended it, implemented `PartitionReader` and called the implementation as `BatchReader`. After adding a common reader factory, we may have multiple batch readers now. That's why `BatchDataReader` seemed like a more accurate name than `BatchReader`. As there were no other places that used this class, I decided to implement `PartitionReader` directly here. Any feedback is welcome. See `SparkScan` below for old usage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6345: Spark 3.3: Choose readers based on task types
aokolnychyi commented on code in PR #6345: URL: https://github.com/apache/iceberg/pull/6345#discussion_r1037687769 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java: ## @@ -28,21 +28,48 @@ import org.apache.iceberg.io.CloseableIterator; import org.apache.iceberg.io.InputFile; import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.spark.source.metrics.TaskNumDeletes; +import org.apache.iceberg.spark.source.metrics.TaskNumSplits; import org.apache.spark.rdd.InputFileBlockHolder; +import org.apache.spark.sql.connector.metric.CustomTaskMetric; +import org.apache.spark.sql.connector.read.PartitionReader; import org.apache.spark.sql.vectorized.ColumnarBatch; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -class BatchDataReader extends BaseBatchReader { +class BatchDataReader extends BaseBatchReader Review Comment: This class was only used as `PartitionReader` in `SparkScan`, where we extended it, mixed `PartitionReader` and called the implementation as `BatchReader`. After adding a common reader factory, we may have multiple batch readers now. That's why `BatchDataReader` seemed like a more accurate name than `BatchReader`. As there were no other places that used this class, I decided to implement `PartitionReader` directly here. Any feedback is welcome. See `SparkScan` below for old usage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6345: Spark 3.3: Choose readers based on task types
aokolnychyi commented on code in PR #6345: URL: https://github.com/apache/iceberg/pull/6345#discussion_r1037688715 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java: ## @@ -28,21 +28,48 @@ import org.apache.iceberg.io.CloseableIterator; import org.apache.iceberg.io.InputFile; import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.spark.source.metrics.TaskNumDeletes; +import org.apache.iceberg.spark.source.metrics.TaskNumSplits; import org.apache.spark.rdd.InputFileBlockHolder; +import org.apache.spark.sql.connector.metric.CustomTaskMetric; +import org.apache.spark.sql.connector.read.PartitionReader; import org.apache.spark.sql.vectorized.ColumnarBatch; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -class BatchDataReader extends BaseBatchReader { +class BatchDataReader extends BaseBatchReader +implements PartitionReader { + private static final Logger LOG = LoggerFactory.getLogger(BatchDataReader.class); + private final long numSplits; + + BatchDataReader(SparkInputPartition partition, int batchSize) { +this( +partition.table(), +partition.taskGroup(), +partition.expectedSchema(), +partition.isCaseSensitive(), +batchSize); + } + BatchDataReader( - ScanTaskGroup task, Review Comment: Most other readers have another order of parameters: table, taskGroup, expectedSchema, caseSensitive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org