date:20221201

[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality

2022-12-01 Thread GitBox



nastra commented on code in PR #6274:
URL: https://github.com/apache/iceberg/pull/6274#discussion_r1036804341


##
data/src/test/java/org/apache/iceberg/data/orc/TestOrcRowIterator.java:
##
@@ -74,7 +75,7 @@ public void writeFile() throws IOException {
 .schema(DATA_SCHEMA)
 // write in such a way that the file contains 2 stripes each with 
4 row groups of 1000
 // rows
-.set("iceberg.orc.vectorbatch.size", "1000")
+.set(TableProperties.ORC_WRITE_BATCH_SIZE, "1000")

Review Comment:
   `iceberg.orc.vectorbatch.size` was deprecated for quite some time. I've 
restored all code that removes deprecated properties and code that deals with 
legacy properties



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality

2022-12-01 Thread GitBox



nastra commented on code in PR #6274:
URL: https://github.com/apache/iceberg/pull/6274#discussion_r1036806161


##
spark/v3.1/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteMergeInto.scala:
##
@@ -228,13 +225,6 @@ case class RewriteMergeInto(spark: SparkSession) extends 
Rule[LogicalPlan] with
 }
   }
 
-  private def isCardinalityCheckEnabled(table: Table): Boolean = {
-PropertyUtil.propertyAsBoolean(
-  table.properties(),
-  MERGE_CARDINALITY_CHECK_ENABLED,

Review Comment:
   I've restored this and updated the comment on the property itself to mention 
that we will drop it once Spark 3.1 support will be dropped



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality

2022-12-01 Thread GitBox



nastra commented on code in PR #6274:
URL: https://github.com/apache/iceberg/pull/6274#discussion_r1036809785


##
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##
@@ -83,21 +83,6 @@ public static  CloseableIterable markDeleted(
 });
   }
 
-  /**
-   * Returns the remaining rows (the ones that are not deleted), while 
counting the deleted ones.
-   *
-   * @param rows the rows to process
-   * @param isDeleted a predicate that determines if a row is deleted
-   * @return the processed rows
-   * @deprecated Will be removed in 1.2.0, use {@link 
Deletes#filterDeleted(CloseableIterable,
-   * Predicate, DeleteCounter)}.
-   */
-  @Deprecated
-  public static  CloseableIterable filterDeleted(

Review Comment:
   This is functionality that was adjusted by #4588. We then re-introduced + 
deprecated the method in #6146 to not have breaking API changes. Also there's 
currently no place in the code that would use this, so it seems better to 
remove it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality

2022-12-01 Thread GitBox



nastra commented on code in PR #6274:
URL: https://github.com/apache/iceberg/pull/6274#discussion_r1036810161


##
core/src/main/java/org/apache/iceberg/TableProperties.java:
##
@@ -342,19 +335,6 @@ private TableProperties() {}
   public static final String MERGE_MODE = "write.merge.mode";
   public static final String MERGE_MODE_DEFAULT = 
RowLevelOperationMode.COPY_ON_WRITE.modeName();
 
-  /**
-   * @deprecated will be removed in 0.14.0, the cardinality check is always 
performed starting from
-   * 0.13.0.
-   */
-  @Deprecated
-  public static final String MERGE_CARDINALITY_CHECK_ENABLED =
-  "write.merge.cardinality-check.enabled";
-  /**
-   * @deprecated will be removed in 0.14.0, the cardinality check is always 
performed starting from
-   * 0.13.0.
-   */
-  @Deprecated public static final boolean 
MERGE_CARDINALITY_CHECK_ENABLED_DEFAULT = true;

Review Comment:
   I've restored this property and mentioned that it can only be dropped once 
Spark 3.1 support is dropped



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality

2022-12-01 Thread GitBox



nastra commented on code in PR #6274:
URL: https://github.com/apache/iceberg/pull/6274#discussion_r1036815990


##
core/src/main/java/org/apache/iceberg/LocationProviders.java:
##
@@ -84,10 +84,7 @@ static class DefaultLocationProvider implements 
LocationProvider {
 private static String dataLocation(Map properties, String 
tableLocation) {
   String dataLocation = 
properties.get(TableProperties.WRITE_DATA_LOCATION);
   if (dataLocation == null) {
-dataLocation = 
properties.get(TableProperties.WRITE_FOLDER_STORAGE_LOCATION);
-if (dataLocation == null) {
-  dataLocation = String.format("%s/data", tableLocation);
-}
+dataLocation = String.format("%s/data", tableLocation);

Review Comment:
   I've restored this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality

2022-12-01 Thread GitBox



nastra commented on code in PR #6274:
URL: https://github.com/apache/iceberg/pull/6274#discussion_r1036815756


##
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##
@@ -758,20 +758,6 @@ protected Map summary() {
 return summaryBuilder.build();
   }
 
-  /**
-   * Apply the update's changes to the base table metadata and return the new 
manifest list.
-   *
-   * @param base the base table metadata to apply changes to
-   * @return a manifest list for the new snapshot.
-   * @deprecated Will be removed in 1.2.0, use {@link 
MergingSnapshotProducer#apply(TableMetadata,
-   * Snapshot)}.
-   */
-  @Deprecated
-  @Override
-  public List apply(TableMetadata base) {

Review Comment:
   This is code that changed this method in #4926. We then restored+deprecated 
the old API in #6146 for one minor release and are removing it here now. 
   Let me know if we'd want to keep those methods, then we can just drop the 
`Deprecated` annotations from those



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality

2022-12-01 Thread GitBox



nastra commented on code in PR #6274:
URL: https://github.com/apache/iceberg/pull/6274#discussion_r1036817726


##
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java:
##
@@ -144,16 +143,16 @@ public void 
testInt96TimestampProducedBySparkIsReadCorrectly() throws IOExceptio
 });
 List rows = 
Lists.newArrayList(RandomData.generateSpark(schema, 10, 0L));
 
-try (FileAppender writer =
-new ParquetWriteAdapter<>(
-new NativeSparkWriterBuilder(outputFile)
-.set("org.apache.spark.sql.parquet.row.attributes", 
sparkSchema.json())
-.set("spark.sql.parquet.writeLegacyFormat", "false")
-.set("spark.sql.parquet.outputTimestampType", "INT96")
-.set("spark.sql.parquet.fieldId.write.enabled", "true")
-.build(),
-MetricsConfig.getDefault())) {
-  writer.addAll(rows);
+try (ParquetWriter writer =

Review Comment:
   this was using the deprecated `ParquetWriteAdapter` class, so I started 
changing code that uses this class so that we can eventually remove 
`ParquetWriteAdapter`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] lichaohao opened a new issue, #6330: iceberg : format-version=2 , when the job is running (insert and update), can not execute rewrite small data file ?

2022-12-01 Thread GitBox



lichaohao opened a new issue, #6330:
URL: https://github.com/apache/iceberg/issues/6330

   ### Query engine
   
   iceberg:1.0.0
   spark:3.2.0
   flink:1.13.2
   catalog:hive-catalog
   
   ### Question
   
   iceberg:1.0.0
   spark:3.2.0
   flink:1.13.2
   catalog:hive-catalog
   
   source table:mysql cdc table: mysql_cdc_source
   sink table:iceberg table:  my_iceberg_sink ==> primary key id 
,format-version=2, write.upsert.enabled=true
   
   execute sql: (checkpoint 1min)
 upsert into my_iceberg_sink select * from mysql_cdc_source;
 ps: mysql exists insert and update operation
   
   when the job is running  somet time, i want to rewrite the iceberg data file 
into bigger one, 【spark execute】 (call 
"hive_prod.system.rewrite_data_files(my_iceberg_sink)")
   
   get the following exception message:
   can not commit,found new position delete for replaced data file: 
GenericDataFilehdfs://data.parquet
   
   
   all above that what should i do can execute "call 
hive_prod.system.rewrite_data_files(my_iceberg_sink) " correctly? 
   Thank you for your answer！
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] SHuixo commented on issue #6330: iceberg : format-version=2 , when the job is running (insert and update), can not execute rewrite small data file ?

2022-12-01 Thread GitBox



SHuixo commented on issue #6330:
URL: https://github.com/apache/iceberg/issues/6330#issuecomment-1333486031

   This situation is the same as I have encountered #6104, and in the current 
version, there is no effective way to support compressing historical data files 
containing delete operations during the data writing process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] ConeyLiu opened a new pull request, #6331: Port #4627 to Spark 2.4/3.1/3.2 and Flink 1.14/1.15

2022-12-01 Thread GitBox



ConeyLiu opened a new pull request, #6331:
URL: https://github.com/apache/iceberg/pull/6331

   This PR just ported #4627 to Spark 2.4/3.1/3.2 and Flink 1.14/1.15.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on pull request #6331: Port #4627 to Spark 2.4/3.1/3.2 and Flink 1.14/1.15

2022-12-01 Thread GitBox



chenjunjiedada commented on PR #6331:
URL: https://github.com/apache/iceberg/pull/6331#issuecomment-1333509100

   What about 1.16?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on pull request #6331: Port #4627 to Spark 2.4/3.1/3.2 and Flink 1.14/1.15

2022-12-01 Thread GitBox



ajantha-bhat commented on PR #6331:
URL: https://github.com/apache/iceberg/pull/6331#issuecomment-1333580498

   > What about 1.16?
   
   I think flink-1.16 and spark-3.3 is handled in the original PR itself. 
   https://github.com/apache/iceberg/pull/4627


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on pull request #6331: Port #4627 to Spark 2.4/3.1/3.2 and Flink 1.14/1.15

2022-12-01 Thread GitBox



ajantha-bhat commented on PR #6331:
URL: https://github.com/apache/iceberg/pull/6331#issuecomment-1333582482

   nit: we could at least split it into one spark and one flink PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on issue #6318: executor logs ton of `INFO CodecPool: Got brand-new decompressor [.zstd]`

2022-12-01 Thread GitBox



nastra commented on issue #6318:
URL: https://github.com/apache/iceberg/issues/6318#issuecomment-1333600955

   I've looked at the code and this happened even before #5681. The 
decompressors are actually being re-used, but not across different Parquet 
Files. So this means in your case you have lots of Parquet files that are being 
read, and thus you're seeing this log output from Hadoop's `CodecPool`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] camper42 closed issue #6318: executor logs ton of `INFO CodecPool: Got brand-new decompressor [.zstd]`

2022-12-01 Thread GitBox



camper42 closed issue #6318: executor logs ton of `INFO CodecPool: Got 
brand-new decompressor [.zstd]`
URL: https://github.com/apache/iceberg/issues/6318


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] camper42 commented on issue #6318: executor logs ton of `INFO CodecPool: Got brand-new decompressor [.zstd]`

2022-12-01 Thread GitBox



camper42 commented on issue #6318:
URL: https://github.com/apache/iceberg/issues/6318#issuecomment-1333612142

   thx, so for my scenario, I may need to compact more promptly and/or set the 
log level of the `CodecPool` to `WARN`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] ConeyLiu commented on pull request #6331: Port #4627 to Spark 2.4/3.1/3.2 and Flink 1.14/1.15

2022-12-01 Thread GitBox



ConeyLiu commented on PR #6331:
URL: https://github.com/apache/iceberg/pull/6331#issuecomment-1333635843

   Thanks @chenjunjiedada @ajantha-bhat for the review.
   
   > What about 1.16?
   It is already done in #4627.
   
   > nit: we could at least split it into one spark and one flink PR or one PR 
for each version.
   OK, let me split it into two patches.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on pull request #6331: Port #4627 to Spark 2.4/3.1/3.2 and Flink 1.14/1.15

2022-12-01 Thread GitBox



chenjunjiedada commented on PR #6331:
URL: https://github.com/apache/iceberg/pull/6331#issuecomment-1333640055

   Well, I thought flink 1.16 was supported recently. Never thought the change 
was already in it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] loleek opened a new issue, #6332: Create tables error when using JDBC catalog and mysql backend

2022-12-01 Thread GitBox



loleek opened a new issue, #6332:
URL: https://github.com/apache/iceberg/issues/6332

   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   _No response_
   
   ### Please describe the bug 🐞
   
   When I use jdbc catalog with mysql backend in flink test case, the create 
catalog operation return an error.
   
   >Flink SQL> CREATE CATALOG iceberg_catalog WITH (
   >   'type'='iceberg',
   >   'catalog-impl'='org.apache.iceberg.jdbc.JdbcCatalog',
   >   'catalog-name' = 'iceberg_catalog',
   >   'warehouse' = 'file:///path/to/warehouse',
   >   'uri' = 'jdbc:mysql://host:port/iceberg_catalog',
   >   'jdbc.user' = 'root',
   >   'jdbc.password' = 'passwd');
   [ERROR] Could not execute SQL statement. Reason:
   com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
too long; max key length is 767 bytes.
   
   I think the root cause is here 
https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/jdbc/JdbcUtil.java#L179
 which use CATALOG_NAME VARCHAR(255) + NAMESPACE_NAME VARCHAR(255) + 
NAMESPACE_PROPERTY_KEY VARCHAR(5500) as primary key.
   
   Here is a related issue https://github.com/apache/iceberg/pull/2778/files, 
which fix CREATE_CATALOG_TABLE statement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] ConeyLiu opened a new pull request, #6333: Port #4627 to Flink 1.14/1.15

2022-12-01 Thread GitBox



ConeyLiu opened a new pull request, #6333:
URL: https://github.com/apache/iceberg/pull/6333

   This PR just ported #4627 to Flink 1.14/1.15.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on issue #6332: Create tables error when using JDBC catalog and mysql backend

2022-12-01 Thread GitBox



nastra commented on issue #6332:
URL: https://github.com/apache/iceberg/issues/6332#issuecomment-1333663036

   The `max key length limit=767` is actually a limit that is imposed by the 
underlying database being used (MySQL in this case). A valid approach would be 
to increase that limit on MySQL. 
   
   Related issue: https://github.com/apache/iceberg/issues/6236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra closed issue #6273: java.sql.SQLException: Access denied for user 'hive'@'hadoopSlave0' (using password: YES)

2022-12-01 Thread GitBox



nastra closed issue #6273: java.sql.SQLException: Access denied for user 
'hive'@'hadoopSlave0' (using password: YES)
URL: https://github.com/apache/iceberg/issues/6273


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on issue #6273: java.sql.SQLException: Access denied for user 'hive'@'hadoopSlave0' (using password: YES)

2022-12-01 Thread GitBox



nastra commented on issue #6273:
URL: https://github.com/apache/iceberg/issues/6273#issuecomment-1333664922

   I'll close this for now. Feel free to re-open if necessary


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on issue #4092: Not an iceberg table Error use hive catalog-type

2022-12-01 Thread GitBox



nastra commented on issue #4092:
URL: https://github.com/apache/iceberg/issues/4092#issuecomment-1333668098

   Closing this for now. Feel free to re-open if necessary


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra closed issue #4092: Not an iceberg table Error use hive catalog-type

2022-12-01 Thread GitBox



nastra closed issue #4092: Not an iceberg table Error use hive catalog-type
URL: https://github.com/apache/iceberg/issues/4092


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on issue #6236: Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 3072 bytes

2022-12-01 Thread GitBox



nastra commented on issue #6236:
URL: https://github.com/apache/iceberg/issues/6236#issuecomment-1333670878

   @yuangjiang you can probably take a look at 
https://dev.mysql.com/doc/refman/8.0/en/innodb-limits.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on issue #5027: JDBC Catalog create table DDL is not compatible for MYSQL 8.0.29

2022-12-01 Thread GitBox



nastra commented on issue #5027:
URL: https://github.com/apache/iceberg/issues/5027#issuecomment-1333676057

   @noneback are you able to increase the limit on MySql? Usually different 
databases impose different limits, but I'll see if we can lower namespace 
properties from 5500.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] Fokko opened a new pull request, #6334: Python: Add warning on projection by name

2022-12-01 Thread GitBox



Fokko opened a new pull request, #6334:
URL: https://github.com/apache/iceberg/pull/6334

   ```python
   ➜  python git:(master) ✗ python3
   Python 3.10.8 (main, Oct 13 2022, 09:48:40) [Clang 14.0.0 
(clang-1400.0.29.102)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   >>> from pyiceberg.catalog import load_catalog
   >>>
   >>> catalog = load_catalog('local')
   >>>
   >>> table = catalog.load_table(('nyc', 'taxis'))
   >>>
   >>> table.scan().to_arrow()
   
/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/table/__init__.py:340: 
UserWarning: Projection is currently done by name instead of column ID, this 
can lead to incorrect results in some cases.
 warnings.warn(
   pyarrow.Table
   VendorID: int64
   tpep_pickup_datetime: timestamp[us, tz=+00:00]
   tpep_dropoff_datetime: timestamp[us, tz=+00:00]
   passenger_count: double
   trip_distance: double
   RatecodeID: double
   store_and_fwd_flag: string
   PULocationID: int64
   DOLocationID: int64
   payment_type: int64
   fare_amount: double
   extra: double
   mta_tax: double
   tip_amount: double
   tolls_amount: double
   improvement_surcharge: double
   total_amount: double
   congestion_surcharge: double
   airport_fee: double
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a diff in pull request #5967: Flink: Support read options in flink source

2022-12-01 Thread GitBox



hililiwei commented on code in PR #5967:
URL: https://github.com/apache/iceberg/pull/5967#discussion_r1037105073


##
docs/flink-getting-started.md:
##
@@ -683,7 +683,47 @@ env.execute("Test Iceberg DataStream");
 OVERWRITE and UPSERT can't be set together. In UPSERT mode, if the table is 
partitioned, the partition fields should be included in equality fields.
 {{< /hint >}}
 
-## Write options
+## Options
+### Read options
+
+Flink read options are passed when configuring the FlinkSink, like this:
+
+```
+IcebergSource.forRowData()
+.tableLoader(tableResource.tableLoader())
+.assignerFactory(new SimpleSplitAssignerFactory())
+.streaming(scanContext.isStreaming())
+.streamingStartingStrategy(scanContext.streamingStartingStrategy())
+.startSnapshotTimestamp(scanContext.startSnapshotTimestamp())
+.startSnapshotId(scanContext.startSnapshotId())
+.set("monitor-interval", "10s")
+.build()
+```
+For Flink SQL, read options can be passed in via SQL hints like this:
+```
+SELECT * FROM tableName /*+ OPTIONS('monitor-interval'='10s') */
+...
+```
+
+| Flink option| Default| 
Description  |
+| --- | -- | 
 |
+| snapshot-id || For time 
travel in batch mode. Read data from the specified snapshot-id. |
+| case-sensitive  | false  | Whether 
the sql is case sensitive|
+| as-of-timestamp || For time 
travel in batch mode. Read data from the most recent snapshot as of the given 
time in milliseconds. |
+| starting-strategy   | INCREMENTAL_FROM_LATEST_SNAPSHOT   | Starting 
strategy for streaming execution. TABLE_SCAN_THEN_INCREMENTAL: Do a regular 
table scan then switch to the incremental mode. The incremental mode starts 
from the current snapshot exclusive. INCREMENTAL_FROM_LATEST_SNAPSHOT: Start 
incremental mode from the latest snapshot inclusive. If it is an empty map, all 
future append snapshots should be discovered. 
INCREMENTAL_FROM_EARLIEST_SNAPSHOT: Start incremental mode from the earliest 
snapshot inclusive. If it is an empty map, all future append snapshots should 
be discovered. INCREMENTAL_FROM_SNAPSHOT_ID: Start incremental mode from a 
snapshot with a specific id inclusive. INCREMENTAL_FROM_SNAPSHOT_TIMESTAMP: 
Start incremental mode from a snapshot with a specific timestamp inclusive. If 
the timestamp is between two snapshots, it should start from the snapshot after 
the timestamp. Just for FIP27 Source |
+| start-snapshot-timestamp|| Start to 
read data from the most recent snapshot as of the given time in milliseconds. |
+| start-snapshot-id   || Start to 
read data from the specified snapshot-id.   |
+| end-snapshot-id | The latest snapshot id | Specifies 
the end snapshot.  |
+| split-size  | Table read.split.target-size   | Overrides 
this table's read.split.target-size|
+| split-lookback  | Table read.split.planning-lookback | Overrides 
this table's read.split.planning-lookback  |
+| split-file-open-cost| Table read.split.open-file-cost| Overrides 
this table's read.split.open-file-cost |
+| streaming   | false  | Sets 
whether the current task runs in streaming or batch mode. |

Review Comment:
   Actually, I think this parameter is a little redundant, why don't we use 
`execute.runtime-mode`? If we're going to make some breaking change, I'm 
inclined to deprecated it first and then delete it. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a diff in pull request #5967: Flink: Support read options in flink source

2022-12-01 Thread GitBox



hililiwei commented on code in PR #5967:
URL: https://github.com/apache/iceberg/pull/5967#discussion_r1037109651


##
docs/flink-getting-started.md:
##
@@ -683,7 +683,47 @@ env.execute("Test Iceberg DataStream");
 OVERWRITE and UPSERT can't be set together. In UPSERT mode, if the table is 
partitioned, the partition fields should be included in equality fields.
 {{< /hint >}}
 
-## Write options
+## Options
+### Read options
+
+Flink read options are passed when configuring the FlinkSink, like this:
+
+```
+IcebergSource.forRowData()
+.tableLoader(tableResource.tableLoader())
+.assignerFactory(new SimpleSplitAssignerFactory())
+.streaming(scanContext.isStreaming())
+.streamingStartingStrategy(scanContext.streamingStartingStrategy())
+.startSnapshotTimestamp(scanContext.startSnapshotTimestamp())
+.startSnapshotId(scanContext.startSnapshotId())
+.set("monitor-interval", "10s")
+.build()
+```
+For Flink SQL, read options can be passed in via SQL hints like this:
+```
+SELECT * FROM tableName /*+ OPTIONS('monitor-interval'='10s') */
+...
+```
+
+| Flink option| Default| 
Description  |
+| --- | -- | 
 |
+| snapshot-id || For time 
travel in batch mode. Read data from the specified snapshot-id. |
+| case-sensitive  | false  | Whether 
the sql is case sensitive|
+| as-of-timestamp || For time 
travel in batch mode. Read data from the most recent snapshot as of the given 
time in milliseconds. |
+| starting-strategy   | INCREMENTAL_FROM_LATEST_SNAPSHOT   | Starting 
strategy for streaming execution. TABLE_SCAN_THEN_INCREMENTAL: Do a regular 
table scan then switch to the incremental mode. The incremental mode starts 
from the current snapshot exclusive. INCREMENTAL_FROM_LATEST_SNAPSHOT: Start 
incremental mode from the latest snapshot inclusive. If it is an empty map, all 
future append snapshots should be discovered. 
INCREMENTAL_FROM_EARLIEST_SNAPSHOT: Start incremental mode from the earliest 
snapshot inclusive. If it is an empty map, all future append snapshots should 
be discovered. INCREMENTAL_FROM_SNAPSHOT_ID: Start incremental mode from a 
snapshot with a specific id inclusive. INCREMENTAL_FROM_SNAPSHOT_TIMESTAMP: 
Start incremental mode from a snapshot with a specific timestamp inclusive. If 
the timestamp is between two snapshots, it should start from the snapshot after 
the timestamp. Just for FIP27 Source |
+| start-snapshot-timestamp|| Start to 
read data from the most recent snapshot as of the given time in milliseconds. |
+| start-snapshot-id   || Start to 
read data from the specified snapshot-id.   |
+| end-snapshot-id | The latest snapshot id | Specifies 
the end snapshot.  |
+| split-size  | Table read.split.target-size   | Overrides 
this table's read.split.target-size|
+| split-lookback  | Table read.split.planning-lookback | Overrides 
this table's read.split.planning-lookback  |
+| split-file-open-cost| Table read.split.open-file-cost| Overrides 
this table's read.split.open-file-cost |
+| streaming   | false  | Sets 
whether the current task runs in streaming or batch mode. |

Review Comment:
   > it has real Flink batch-read usage in production, especially before Flink 
1.16 version.
   
   Even version 1.16, which we assessed, still seems to be inadequate in batch. 
 Our business guys have no incentive to migrate from spark to it. But I also 
tend not to change it.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a diff in pull request #5967: Flink: Support read options in flink source

2022-12-01 Thread GitBox



hililiwei commented on code in PR #5967:
URL: https://github.com/apache/iceberg/pull/5967#discussion_r1037113844


##
docs/flink-getting-started.md:
##
@@ -683,7 +683,47 @@ env.execute("Test Iceberg DataStream");
 OVERWRITE and UPSERT can't be set together. In UPSERT mode, if the table is 
partitioned, the partition fields should be included in equality fields.
 {{< /hint >}}
 
-## Write options
+## Options
+### Read options
+
+Flink read options are passed when configuring the FlinkSink, like this:
+
+```
+IcebergSource.forRowData()
+.tableLoader(tableResource.tableLoader())
+.assignerFactory(new SimpleSplitAssignerFactory())
+.streaming(scanContext.isStreaming())
+.streamingStartingStrategy(scanContext.streamingStartingStrategy())
+.startSnapshotTimestamp(scanContext.startSnapshotTimestamp())
+.startSnapshotId(scanContext.startSnapshotId())
+.set("monitor-interval", "10s")
+.build()
+```
+For Flink SQL, read options can be passed in via SQL hints like this:
+```
+SELECT * FROM tableName /*+ OPTIONS('monitor-interval'='10s') */
+...
+```
+
+| Flink option| Default| 
Description  |
+| --- | -- | 
 |
+| snapshot-id || For time 
travel in batch mode. Read data from the specified snapshot-id. |
+| case-sensitive  | false  | Whether 
the sql is case sensitive|
+| as-of-timestamp || For time 
travel in batch mode. Read data from the most recent snapshot as of the given 
time in milliseconds. |
+| starting-strategy   | INCREMENTAL_FROM_LATEST_SNAPSHOT   | Starting 
strategy for streaming execution. TABLE_SCAN_THEN_INCREMENTAL: Do a regular 
table scan then switch to the incremental mode. The incremental mode starts 
from the current snapshot exclusive. INCREMENTAL_FROM_LATEST_SNAPSHOT: Start 
incremental mode from the latest snapshot inclusive. If it is an empty map, all 
future append snapshots should be discovered. 
INCREMENTAL_FROM_EARLIEST_SNAPSHOT: Start incremental mode from the earliest 
snapshot inclusive. If it is an empty map, all future append snapshots should 
be discovered. INCREMENTAL_FROM_SNAPSHOT_ID: Start incremental mode from a 
snapshot with a specific id inclusive. INCREMENTAL_FROM_SNAPSHOT_TIMESTAMP: 
Start incremental mode from a snapshot with a specific timestamp inclusive. If 
the timestamp is between two snapshots, it should start from the snapshot after 
the timestamp. Just for FIP27 Source |
+| start-snapshot-timestamp|| Start to 
read data from the most recent snapshot as of the given time in milliseconds. |
+| start-snapshot-id   || Start to 
read data from the specified snapshot-id.   |
+| end-snapshot-id | The latest snapshot id | Specifies 
the end snapshot.  |
+| split-size  | Table read.split.target-size   | Overrides 
this table's read.split.target-size|
+| split-lookback  | Table read.split.planning-lookback | Overrides 
this table's read.split.planning-lookback  |
+| split-file-open-cost| Table read.split.open-file-cost| Overrides 
this table's read.split.open-file-cost |
+| streaming   | false  | Sets 
whether the current task runs in streaming or batch mode. |
+| monitor-interval| 10s| Interval 
for listening on the generation of new snapshots.   |
+| include-column-stats| false  | Create a 
new scan from this that loads the column stats with each data file. Column 
stats include: value count, null value count, lower bounds, and upper bounds. |
+| max-planning-snapshot-count | Integer.MAX_VALUE  | If there 
are multiple new snapshots, configure the maximum number of snapshot forward at 
a time. |

Review Comment:
   I have some conflicts here. If we make it smaller, will it take a long time 
to catch up to the latest snapshot  when the data backlog is severe (such as 
after a restart)?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceber

[GitHub] [iceberg] hililiwei commented on pull request #5967: Flink: Support read options in flink source

2022-12-01 Thread GitBox



hililiwei commented on PR #5967:
URL: https://github.com/apache/iceberg/pull/5967#issuecomment-1333783891

   > > Hmm, I understand your concern now. The job-level configuration should 
have a connector prefix that makes sense to me, shall we consider using the 
same prefix?
   > 
   > Yeah. I was worry about config naming collision. A common prefix like 
`connector.iceberg.` should be enough. But it will be a problem to change it 
for the write configs, which is already released.
   
   I'm in favor of adding a prefix for the new one. The old ones are left as is 
for compatibility.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] robinsinghstudios commented on issue #5977: How to write to a bucket-partitioned table using PySpark?

2022-12-01 Thread GitBox



robinsinghstudios commented on issue #5977:
URL: https://github.com/apache/iceberg/issues/5977#issuecomment-1333788692

   Hi,
   
   My data is being appended to the partitioned table without registering any 
UDFs but the data seems to be written as one row per file which is creating a 
huge performance impact. Any suggestions or fixes will be really appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] ConeyLiu opened a new pull request, #6335: Core: Avoid generating a large ManifestFile when committing

2022-12-01 Thread GitBox



ConeyLiu opened a new pull request, #6335:
URL: https://github.com/apache/iceberg/pull/6335

   In our production env, we noticed the manifest files have a large random 
size, ranging from several KB to larger than 100 MB.  It seems the 
`MANIFEST_TARGET_SIZE_BYTES` has not worked during the commit phase.
   
   In this PR, we avoid generating a manifest file larger than 
`MANIFEST_TARGET_SIZE_BYTES` for newly added content files and will generate 
multiple manifest files when the size is covered.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on pull request #5967: Flink: Support read options in flink source

2022-12-01 Thread GitBox



hililiwei commented on PR #5967:
URL: https://github.com/apache/iceberg/pull/5967#issuecomment-1333793436

   
   > 2. some of the read configs don't make sense to set in table environment 
configs, as they are tied to a specific source/table. How should handle this 
situation?
   > 
   > ```
   > public static final ConfigOption SNAPSHOT_ID =
   >   ConfigOptions.key("snapshot-id").longType().defaultValue(null);
   > 
   >   public static final ConfigOption START_SNAPSHOT_ID =
   >   ConfigOptions.key("start-snapshot-id").longType().defaultValue(null);
   > 
   >   public static final ConfigOption END_SNAPSHOT_ID =
   >   ConfigOptions.key("end-snapshot-id").longType().defaultValue(null);
   > ```
   
   They will not be set in the table environment. For example:
   ```
 public Long startSnapshotId() {
   return 
confParser.longConf().option(FlinkReadOptions.START_SNAPSHOT_ID.key()).parseOptional();
 }
   ```
   It only works at  job-level.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] ConeyLiu commented on pull request #6335: Core: Avoid generating a large ManifestFile when committing

2022-12-01 Thread GitBox



ConeyLiu commented on PR #6335:
URL: https://github.com/apache/iceberg/pull/6335#issuecomment-1333794619

   Hi @szehon-ho @rdblue @pvary @stevenzwu @chenjunjiedada pls help to review 
this when you are free. Thanks a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] ConeyLiu commented on pull request #6331: Port #4627 to Spark 2.4/3.1/3.2

2022-12-01 Thread GitBox



ConeyLiu commented on PR #6331:
URL: https://github.com/apache/iceberg/pull/6331#issuecomment-1333796211

   cc @szehon-ho @pvary @kbendick who reviewed the original pr.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] ConeyLiu commented on pull request #6333: Port #4627 to Flink 1.14/1.15

2022-12-01 Thread GitBox



ConeyLiu commented on PR #6333:
URL: https://github.com/apache/iceberg/pull/6333#issuecomment-1333796294

   cc @szehon-ho @pvary @kbendick who reviewed the original pr.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] hililiwei opened a new pull request, #6336: Doc: Replace build with append in the Flink sink doc

2022-12-01 Thread GitBox



hililiwei opened a new pull request, #6336:
URL: https://github.com/apache/iceberg/pull/6336

   `build()` has been removed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a diff in pull request #6222: Flink: Support inspecting table

2022-12-01 Thread GitBox



hililiwei commented on code in PR #6222:
URL: https://github.com/apache/iceberg/pull/6222#discussion_r1037191643


##
flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/data/StructRowData.java:
##
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.flink.data;
+
+import java.math.BigDecimal;
+import java.nio.ByteBuffer;
+import java.time.Duration;
+import java.time.Instant;
+import java.time.LocalDate;
+import java.time.OffsetDateTime;
+import java.util.Collection;
+import java.util.List;
+import java.util.Map;
+import java.util.function.BiConsumer;
+import java.util.function.Function;
+import org.apache.flink.table.data.ArrayData;
+import org.apache.flink.table.data.DecimalData;
+import org.apache.flink.table.data.GenericArrayData;
+import org.apache.flink.table.data.GenericMapData;
+import org.apache.flink.table.data.MapData;
+import org.apache.flink.table.data.RawValueData;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.data.StringData;
+import org.apache.flink.table.data.TimestampData;
+import org.apache.flink.types.RowKind;
+import org.apache.iceberg.StructLike;
+import org.apache.iceberg.flink.FlinkSchemaUtil;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+import org.apache.iceberg.util.ByteBuffers;
+
+public class StructRowData implements RowData {
+  private final Types.StructType type;
+  private RowKind kind;
+  private StructLike struct;
+
+  public StructRowData(Types.StructType type) {
+this(type, RowKind.INSERT);
+  }
+
+  public StructRowData(Types.StructType type, RowKind kind) {
+this(type, null, kind);
+  }
+
+  private StructRowData(Types.StructType type, StructLike struct) {
+this(type, struct, RowKind.INSERT);
+  }
+
+  private StructRowData(Types.StructType type, StructLike struct, RowKind 
kind) {
+this.type = type;
+this.struct = struct;
+this.kind = kind;
+  }
+
+  public StructRowData setStruct(StructLike newStruct) {
+this.struct = newStruct;
+return this;
+  }
+
+  @Override
+  public int getArity() {
+return struct.size();
+  }
+
+  @Override
+  public RowKind getRowKind() {
+return kind;
+  }
+
+  @Override
+  public void setRowKind(RowKind newKind) {
+Preconditions.checkNotNull(newKind, "kind can not be null");
+this.kind = newKind;
+  }
+
+  @Override
+  public boolean isNullAt(int pos) {
+return struct.get(pos, Object.class) == null;
+  }
+
+  @Override
+  public boolean getBoolean(int pos) {
+return struct.get(pos, Boolean.class);
+  }
+
+  @Override
+  public byte getByte(int pos) {
+return (byte) (int) struct.get(pos, Integer.class);

Review Comment:
   All types here are derived from 
   
https://github.com/apache/iceberg/blob/49a0ea956e3a4b9979754c887d803d4ab51131ae/api/src/main/java/org/apache/iceberg/types/Types.java#L40-L54
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a diff in pull request #6222: Flink: Support inspecting table

2022-12-01 Thread GitBox



hililiwei commented on code in PR #6222:
URL: https://github.com/apache/iceberg/pull/6222#discussion_r1037191643


##
flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/data/StructRowData.java:
##
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.flink.data;
+
+import java.math.BigDecimal;
+import java.nio.ByteBuffer;
+import java.time.Duration;
+import java.time.Instant;
+import java.time.LocalDate;
+import java.time.OffsetDateTime;
+import java.util.Collection;
+import java.util.List;
+import java.util.Map;
+import java.util.function.BiConsumer;
+import java.util.function.Function;
+import org.apache.flink.table.data.ArrayData;
+import org.apache.flink.table.data.DecimalData;
+import org.apache.flink.table.data.GenericArrayData;
+import org.apache.flink.table.data.GenericMapData;
+import org.apache.flink.table.data.MapData;
+import org.apache.flink.table.data.RawValueData;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.data.StringData;
+import org.apache.flink.table.data.TimestampData;
+import org.apache.flink.types.RowKind;
+import org.apache.iceberg.StructLike;
+import org.apache.iceberg.flink.FlinkSchemaUtil;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+import org.apache.iceberg.util.ByteBuffers;
+
+public class StructRowData implements RowData {
+  private final Types.StructType type;
+  private RowKind kind;
+  private StructLike struct;
+
+  public StructRowData(Types.StructType type) {
+this(type, RowKind.INSERT);
+  }
+
+  public StructRowData(Types.StructType type, RowKind kind) {
+this(type, null, kind);
+  }
+
+  private StructRowData(Types.StructType type, StructLike struct) {
+this(type, struct, RowKind.INSERT);
+  }
+
+  private StructRowData(Types.StructType type, StructLike struct, RowKind 
kind) {
+this.type = type;
+this.struct = struct;
+this.kind = kind;
+  }
+
+  public StructRowData setStruct(StructLike newStruct) {
+this.struct = newStruct;
+return this;
+  }
+
+  @Override
+  public int getArity() {
+return struct.size();
+  }
+
+  @Override
+  public RowKind getRowKind() {
+return kind;
+  }
+
+  @Override
+  public void setRowKind(RowKind newKind) {
+Preconditions.checkNotNull(newKind, "kind can not be null");
+this.kind = newKind;
+  }
+
+  @Override
+  public boolean isNullAt(int pos) {
+return struct.get(pos, Object.class) == null;
+  }
+
+  @Override
+  public boolean getBoolean(int pos) {
+return struct.get(pos, Boolean.class);
+  }
+
+  @Override
+  public byte getByte(int pos) {
+return (byte) (int) struct.get(pos, Integer.class);

Review Comment:
   All types here are derived from 
   
https://github.com/apache/iceberg/blob/49a0ea956e3a4b9979754c887d803d4ab51131ae/api/src/main/java/org/apache/iceberg/types/Types.java#L40-L54
   
   It doesn't has `byte` and `short`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] InvisibleProgrammer opened a new pull request, #6337: Update Iceberg Hive documentation

2022-12-01 Thread GitBox



InvisibleProgrammer opened a new pull request, #6337:
URL: https://github.com/apache/iceberg/pull/6337

   Issue: https://github.com/apache/iceberg/issues/6249
   
   This documentation contains the list of new features introduced in Hive 
4.0.0-alpha-2.
   
   The only exception is https://issues.apache.org/jira/browse/HIVE-26077. It 
seems it was already supported in alpha-1 so I just refreshed the related part 
of the documentation.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a diff in pull request #6222: Flink: Support inspecting table

2022-12-01 Thread GitBox



hililiwei commented on code in PR #6222:
URL: https://github.com/apache/iceberg/pull/6222#discussion_r1037252208


##
flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/data/StructRowData.java:
##
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.flink.data;
+
+import java.math.BigDecimal;
+import java.nio.ByteBuffer;
+import java.time.Duration;
+import java.time.Instant;
+import java.time.LocalDate;
+import java.time.OffsetDateTime;
+import java.util.Collection;
+import java.util.List;
+import java.util.Map;
+import java.util.function.BiConsumer;
+import java.util.function.Function;
+import org.apache.flink.table.data.ArrayData;
+import org.apache.flink.table.data.DecimalData;
+import org.apache.flink.table.data.GenericArrayData;
+import org.apache.flink.table.data.GenericMapData;
+import org.apache.flink.table.data.MapData;
+import org.apache.flink.table.data.RawValueData;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.data.StringData;
+import org.apache.flink.table.data.TimestampData;
+import org.apache.flink.types.RowKind;
+import org.apache.iceberg.StructLike;
+import org.apache.iceberg.flink.FlinkSchemaUtil;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+import org.apache.iceberg.util.ByteBuffers;
+
+public class StructRowData implements RowData {
+  private final Types.StructType type;
+  private RowKind kind;
+  private StructLike struct;
+
+  public StructRowData(Types.StructType type) {
+this(type, RowKind.INSERT);
+  }
+
+  public StructRowData(Types.StructType type, RowKind kind) {
+this(type, null, kind);
+  }
+
+  private StructRowData(Types.StructType type, StructLike struct) {

Review Comment:
   Yes. It used to handle the STRUCT/ROW type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a diff in pull request #6222: Flink: Support inspecting table

2022-12-01 Thread GitBox



hililiwei commented on code in PR #6222:
URL: https://github.com/apache/iceberg/pull/6222#discussion_r1037253800


##
flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/FlinkCatalog.java:
##
@@ -148,6 +150,17 @@ private Namespace toNamespace(String database) {
   }
 
   TableIdentifier toIdentifier(ObjectPath path) {
+String objectName = path.getObjectName();
+List tableName = Splitter.on('$').splitToList(objectName);
+if (tableName.size() > 1 && MetadataTableType.from(tableName.get(1)) != 
null) {

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #6326: estimateStatistics cost mush time to compute stats

2022-12-01 Thread GitBox



RussellSpitzer commented on issue #6326:
URL: https://github.com/apache/iceberg/issues/6326#issuecomment-1333946715

   While I don't have a problem with disabling statistics reporting, I am 
pretty dubious this takes that long. What I believe you are actually seeing is 
the task list being created fort the first time and stored in a list. We use a 
lazy iterator which needs to be turned into a list before the job begins (even 
if statistics are not reported). This means even if we don't spend the time 
iterating the list when we are estimating stats, we will spend that same amount 
of time later when planning tasks. The only difference would be in the current 
case the second access to "tasks()" is cached so it's very fast.
   
   In this case the speed could probably be improved if the parallelism of the 
Manifest Reads was increased. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra opened a new pull request, #6338: Core: Use lower lengths for iceberg_namespace_properties / iceberg_tables in JdbcCatalog

2022-12-01 Thread GitBox



nastra opened a new pull request, #6338:
URL: https://github.com/apache/iceberg/pull/6338

   Users are running into issues when hooking up the `JdbcCatalog` with `MySql` 
or other Databases, which actually impose lower limits than 
[sqlite](https://www.sqlite.org/limits.html) (which we use for testing the 
`JdbcCatalog`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a diff in pull request #6338: Core: Use lower lengths for iceberg_namespace_properties / iceberg_tables in JdbcCatalog

2022-12-01 Thread GitBox



nastra commented on code in PR #6338:
URL: https://github.com/apache/iceberg/pull/6338#discussion_r1037273853


##
core/src/main/java/org/apache/iceberg/jdbc/JdbcUtil.java:
##
@@ -185,9 +185,9 @@ final class JdbcUtil {
   + NAMESPACE_NAME
   + " VARCHAR(255) NOT NULL,"
   + NAMESPACE_PROPERTY_KEY
-  + " VARCHAR(5500),"
+  + " VARCHAR(255),"

Review Comment:
   CATALOG_NAME + NAMESPACE_NAME + NAMESPACE_PROPERTY_KEY make up the primary 
key, and the primary key size is limited by 767 ( 255 * 3 = 765 < 767)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a diff in pull request #6338: Core: Use lower lengths for iceberg_namespace_properties / iceberg_tables in JdbcCatalog

2022-12-01 Thread GitBox



nastra commented on code in PR #6338:
URL: https://github.com/apache/iceberg/pull/6338#discussion_r1037276526


##
core/src/main/java/org/apache/iceberg/jdbc/JdbcUtil.java:
##
@@ -72,9 +72,9 @@ final class JdbcUtil {
   + TABLE_NAME
   + " VARCHAR(255) NOT NULL,"
   + METADATA_LOCATION
-  + " VARCHAR(5500),"
+  + " VARCHAR(1000),"

Review Comment:
   arbitrarily chosen to be lower than 3072 bytes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on pull request #6338: Core: Use lower lengths for iceberg_namespace_properties / iceberg_tables in JdbcCatalog

2022-12-01 Thread GitBox



nastra commented on PR #6338:
URL: https://github.com/apache/iceberg/pull/6338#issuecomment-1333965778

   @openinx given that you reviewed https://github.com/apache/iceberg/pull/2778 
back then, could you review this one as well please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on issue #6332: Create tables error when using JDBC catalog and mysql backend

2022-12-01 Thread GitBox



nastra commented on issue #6332:
URL: https://github.com/apache/iceberg/issues/6332#issuecomment-1333969634

   I have created to https://github.com/apache/iceberg/pull/6338 to make this 
less of an issue with MySql


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on issue #5027: JDBC Catalog create table DDL is not compatible for MYSQL 8.0.29

2022-12-01 Thread GitBox



nastra commented on issue #5027:
URL: https://github.com/apache/iceberg/issues/5027#issuecomment-1333969840

   I have created to https://github.com/apache/iceberg/pull/6338 to make this 
less of an issue with MySql


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on issue #6236: Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 3072 bytes

2022-12-01 Thread GitBox



nastra commented on issue #6236:
URL: https://github.com/apache/iceberg/issues/6236#issuecomment-1333970109

   I have created to https://github.com/apache/iceberg/pull/6338 to make this 
less of an issue with MySql


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #5977: How to write to a bucket-partitioned table using PySpark?

2022-12-01 Thread GitBox



RussellSpitzer commented on issue #5977:
URL: https://github.com/apache/iceberg/issues/5977#issuecomment-1333976156

   @robinsinghstudios data needs to be globally sorted on the bucketing 
function, at least that is my guess with those symptoms. Luckily this should 
all be less of an issue once the function catalog things in Spark 3.3 are all 
sorted out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi opened a new pull request, #6339: Spark 3.3: Add hours transform function

2022-12-01 Thread GitBox



aokolnychyi opened a new pull request, #6339:
URL: https://github.com/apache/iceberg/pull/6339

   This PR adds `hours` transform function in Spark 3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #6339: Spark 3.3: Add hours transform function

2022-12-01 Thread GitBox



aokolnychyi commented on PR #6339:
URL: https://github.com/apache/iceberg/pull/6339#issuecomment-1334029975

   cc @kbendick @rdblue @RussellSpitzer @szehon-ho @flyrain @gaborkaszab @nastra


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] danielcweeks merged pull request #6338: Core: Use lower lengths for iceberg_namespace_properties / iceberg_tables in JdbcCatalog

2022-12-01 Thread GitBox



danielcweeks merged PR #6338:
URL: https://github.com/apache/iceberg/pull/6338


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] danielcweeks closed issue #5027: JDBC Catalog create table DDL is not compatible for MYSQL 8.0.29

2022-12-01 Thread GitBox



danielcweeks closed issue #5027: JDBC Catalog create table DDL is not 
compatible for MYSQL 8.0.29
URL: https://github.com/apache/iceberg/issues/5027


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] danielcweeks closed issue #6236: Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 3072 bytes

2022-12-01 Thread GitBox



danielcweeks closed issue #6236: Caused by: 
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
too long; max key length is 3072 bytes
URL: https://github.com/apache/iceberg/issues/6236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] danielcweeks closed issue #6332: Create tables error when using JDBC catalog and mysql backend

2022-12-01 Thread GitBox



danielcweeks closed issue #6332: Create tables error when using JDBC catalog 
and mysql backend
URL: https://github.com/apache/iceberg/issues/6332


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] danielcweeks merged pull request #6322: Core: Fix NPE in CloseableIterable.close()

2022-12-01 Thread GitBox



danielcweeks merged PR #6322:
URL: https://github.com/apache/iceberg/pull/6322


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] danielcweeks merged pull request #6317: Core: MetadataUpdateParser should write updates/removals fields rather than updated/removed

2022-12-01 Thread GitBox



danielcweeks merged PR #6317:
URL: https://github.com/apache/iceberg/pull/6317


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] alesk opened a new issue, #6340: Deleting column from an iceberg table breaks schema in AWS Glue catalog

2022-12-01 Thread GitBox



alesk opened a new issue, #6340:
URL: https://github.com/apache/iceberg/issues/6340

   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   When deleting a column with Spark SQL using `ALTER TABLE table_name DROP 
COLUMN column_name`, the following happens:
   
   1) The column gets removed from the current schema in the Table's metadata 
on s3. 
   2) A new version of the schema is created in AWS Glue but the column is 
moved to the last position in the table instead of being removed.
   
   The two schemas have diverged. Querying with Spark works as expected but 
using Athena the wrong schema is reported thus breaking tools such as Metabase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on issue #6340: Deleting a column from an iceberg table breaks schema in AWS Glue catalog

2022-12-01 Thread GitBox



nastra commented on issue #6340:
URL: https://github.com/apache/iceberg/issues/6340#issuecomment-1334099047

   @amogh-jahagirdar could you please take a look at this one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6338: Core: Use lower lengths for iceberg_namespace_properties / iceberg_tables in JdbcCatalog

2022-12-01 Thread GitBox



rdblue commented on code in PR #6338:
URL: https://github.com/apache/iceberg/pull/6338#discussion_r1037388592


##
core/src/main/java/org/apache/iceberg/jdbc/JdbcUtil.java:
##
@@ -72,9 +72,9 @@ final class JdbcUtil {
   + TABLE_NAME
   + " VARCHAR(255) NOT NULL,"
   + METADATA_LOCATION
-  + " VARCHAR(5500),"
+  + " VARCHAR(1000),"

Review Comment:
   Why not make this as long as we can to avoid truncation? What happens if a 
value is truncated?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi merged pull request #6289: Spark 2.4: Preserve file seq numbers while rewriting manifests

2022-12-01 Thread GitBox



aokolnychyi merged PR #6289:
URL: https://github.com/apache/iceberg/pull/6289


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #6289: Spark 2.4: Preserve file seq numbers while rewriting manifests

2022-12-01 Thread GitBox



aokolnychyi commented on PR #6289:
URL: https://github.com/apache/iceberg/pull/6289#issuecomment-1334119345

   Thanks, @nastra! Sorry for the delay.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality

2022-12-01 Thread GitBox



aokolnychyi commented on code in PR #6274:
URL: https://github.com/apache/iceberg/pull/6274#discussion_r1037396694


##
spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteManifestsSparkAction.java:
##
@@ -356,7 +356,7 @@ private static ManifestFile writeManifest(
 long snapshotId = row.getLong(0);
 long sequenceNumber = row.getLong(1);
 Row file = row.getStruct(2);
-writer.existing(wrapper.wrap(file), snapshotId, sequenceNumber);
+writer.existing(wrapper.wrap(file), snapshotId, sequenceNumber, null);

Review Comment:
   Yep, thanks. I merged.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #6274: Core|ORC|Spark: Remove deprecated functionality

2022-12-01 Thread GitBox



aokolnychyi commented on PR #6274:
URL: https://github.com/apache/iceberg/pull/6274#issuecomment-1334121747

   I'll take a look at seq number related changes today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6309: Spark 3.3: Consume arbitrary scans in SparkBatchQueryScan

2022-12-01 Thread GitBox



aokolnychyi commented on code in PR #6309:
URL: https://github.com/apache/iceberg/pull/6309#discussion_r1037409951


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java:
##
@@ -63,7 +64,7 @@ class SparkBatchQueryScan extends SparkScan implements 
SupportsRuntimeFiltering
 
   private static final Logger LOG = 
LoggerFactory.getLogger(SparkBatchQueryScan.class);
 
-  private final TableScan scan;
+  private final Scan scan;

Review Comment:
   Changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6309: Spark 3.3: Consume arbitrary scans in SparkBatchQueryScan

2022-12-01 Thread GitBox



aokolnychyi commented on code in PR #6309:
URL: https://github.com/apache/iceberg/pull/6309#discussion_r1037410284


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java:
##
@@ -213,40 +215,53 @@ public Scan build() {
 
 Schema expectedSchema = schemaWithMetadataColumns();
 
-TableScan scan =
-table
-.newScan()
-.caseSensitive(caseSensitive)
-.filter(filterExpression())
-.project(expectedSchema);
+if (startSnapshotId == null) {
+  BatchScan scan =

Review Comment:
   Went ahead and added separate methods in this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #6309: Spark 3.3: Consume arbitrary scans in SparkBatchQueryScan

2022-12-01 Thread GitBox



aokolnychyi commented on PR #6309:
URL: https://github.com/apache/iceberg/pull/6309#issuecomment-1334137970

   I tried adding boundaries but it was mostly useless as we need to support 
arbitrary types. The current approach seems most reasonable at the moment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #6328: Python: Set version to 0.2.0

2022-12-01 Thread GitBox



rdblue merged PR #6328:
URL: https://github.com/apache/iceberg/pull/6328


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #6334: Python: Add warning on projection by name

2022-12-01 Thread GitBox



rdblue merged PR #6334:
URL: https://github.com/apache/iceberg/pull/6334


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a diff in pull request #6274: Core|ORC|Spark: Remove deprecated functionality

2022-12-01 Thread GitBox



nastra commented on code in PR #6274:
URL: https://github.com/apache/iceberg/pull/6274#discussion_r1037420331


##
spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteManifestsSparkAction.java:
##
@@ -356,7 +356,7 @@ private static ManifestFile writeManifest(
 long snapshotId = row.getLong(0);
 long sequenceNumber = row.getLong(1);
 Row file = row.getStruct(2);
-writer.existing(wrapper.wrap(file), snapshotId, sequenceNumber);
+writer.existing(wrapper.wrap(file), snapshotId, sequenceNumber, null);

Review Comment:
   great, thanks. I'll rebase this PR tomorrow



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6309: Spark 3.3: Consume arbitrary scans in SparkBatchQueryScan

2022-12-01 Thread GitBox



RussellSpitzer commented on code in PR #6309:
URL: https://github.com/apache/iceberg/pull/6309#discussion_r1037424678


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java:
##
@@ -211,11 +213,19 @@ public Scan build() {
 SparkReadOptions.END_SNAPSHOT_ID,
 SparkReadOptions.START_SNAPSHOT_ID);
 
+if (startSnapshotId != null) {
+  return buildIncrementalAppendScan(startSnapshotId, endSnapshotId);

Review Comment:
   I do think this is cleaner to me as a code reviewer 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on pull request #6314: Core: Re-add and deprecate HMS_TABLE_OWNER to TableProperties

2022-12-01 Thread GitBox



szehon-ho commented on PR #6314:
URL: https://github.com/apache/iceberg/pull/6314#issuecomment-1334176715

   Thanks a lot for the fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi merged pull request #6339: Spark 3.3: Add hours transform function

2022-12-01 Thread GitBox



aokolnychyi merged PR #6339:
URL: https://github.com/apache/iceberg/pull/6339


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #6339: Spark 3.3: Add hours transform function

2022-12-01 Thread GitBox



aokolnychyi commented on PR #6339:
URL: https://github.com/apache/iceberg/pull/6339#issuecomment-1334220935

   Thanks, @RussellSpitzer @nastra!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #5376: Core: Add readable metrics columns to files metadata tables

2022-12-01 Thread GitBox



szehon-ho commented on code in PR #5376:
URL: https://github.com/apache/iceberg/pull/5376#discussion_r1037486854


##
core/src/main/java/org/apache/iceberg/MetricsUtil.java:
##
@@ -56,4 +72,270 @@ public static MetricsModes.MetricsMode metricsMode(
 String columnName = inputSchema.findColumnName(fieldId);
 return metricsConfig.columnMode(columnName);
   }
+
+  public static final List READABLE_COL_METRICS =
+  ImmutableList.of(
+  new ReadableMetricCol("column_size", f -> Types.LongType.get(), 
"Total size on disk"),
+  new ReadableMetricCol(
+  "value_count", f -> Types.LongType.get(), "Total count, 
including null and NaN"),
+  new ReadableMetricCol("null_value_count", f -> Types.LongType.get(), 
"Null value count"),
+  new ReadableMetricCol("nan_value_count", f -> Types.LongType.get(), 
"NaN value count"),
+  new ReadableMetricCol("lower_bound", Types.NestedField::type, "Lower 
bound"),
+  new ReadableMetricCol("upper_bound", Types.NestedField::type, "Upper 
bound"));
+
+  public static final String READABLE_METRICS = "readable_metrics";
+
+  public static class ReadableMetricCol {
+private final String name;
+private final Function typeFunction;
+private final String doc;
+
+ReadableMetricCol(String name, Function 
typeFunction, String doc) {
+  this.name = name;
+  this.typeFunction = typeFunction;
+  this.doc = doc;
+}
+
+String name() {
+  return name;
+}
+
+Type type(Types.NestedField field) {
+  return typeFunction.apply(field);
+}
+
+String doc() {
+  return doc;
+}
+  }
+
+  /**
+   * Represents a struct of metrics for a primitive column
+   *
+   * @param  primitive column type
+   */
+  public static class ReadableColMetricsStruct implements StructLike {
+
+private final String columnName;
+private final Long columnSize;
+private final Long valueCount;
+private final Long nullValueCount;
+private final Long nanValueCount;
+private final T lowerBound;
+private final T upperBound;
+private final Map projectionMap;
+
+public ReadableColMetricsStruct(
+String columnName,
+Long columnSize,
+Long valueCount,
+Long nullValueCount,
+Long nanValueCount,
+T lowerBound,
+T upperBound,
+Types.NestedField projection) {
+  this.columnName = columnName;
+  this.columnSize = columnSize;
+  this.valueCount = valueCount;
+  this.nullValueCount = nullValueCount;
+  this.nanValueCount = nanValueCount;
+  this.lowerBound = lowerBound;
+  this.upperBound = upperBound;
+  this.projectionMap = readableMetricsProjection(projection);
+}
+
+@Override
+public int size() {
+  return projectionMap.size();
+}
+
+@Override
+public  T get(int pos, Class javaClass) {
+  Object value = get(pos);
+  return value == null ? null : javaClass.cast(value);
+}
+
+@Override
+public  void set(int pos, T value) {
+  throw new UnsupportedOperationException("ReadableMetricsStruct is read 
only");
+}
+
+private Object get(int pos) {
+  int projectedPos = projectionMap.get(pos);
+  switch (projectedPos) {
+case 0:
+  return columnSize;
+case 1:
+  return valueCount;
+case 2:
+  return nullValueCount;
+case 3:
+  return nanValueCount;
+case 4:
+  return lowerBound;
+case 5:
+  return upperBound;
+default:
+  throw new IllegalArgumentException(
+  String.format("Invalid projected pos %d", projectedPos));
+  }
+}
+
+/** @return map of projected position to actual position of this struct's 
fields */
+private Map readableMetricsProjection(Types.NestedField 
projection) {
+  Map result = Maps.newHashMap();
+
+  Set projectedFields =
+  Sets.newHashSet(
+  projection.type().asStructType().fields().stream()
+  .map(Types.NestedField::name)
+  .collect(Collectors.toSet()));
+
+  int projectedIndex = 0;
+  for (int fieldIndex = 0; fieldIndex < READABLE_COL_METRICS.size(); 
fieldIndex++) {
+ReadableMetricCol readableMetric = 
READABLE_COL_METRICS.get(fieldIndex);
+
+if (projectedFields.contains(readableMetric.name())) {
+  result.put(projectedIndex, fieldIndex);
+  projectedIndex++;
+}
+  }
+  return result;
+}
+
+String columnName() {
+  return columnName;
+}
+  }
+
+  /**
+   * Represents a struct, consisting of all {@link ReadableColMetricsStruct} 
for all primitive
+   * columns of the table
+   */
+  public static class ReadableMetricsStruct implements StructLike {
+
+private final List columnMetrics;
+
+public ReadableMetricsStruct(List columnMetrics) {
+  this.columnMetrics = columnMetrics;
+}
+
+@Override

[GitHub] [iceberg] flyrain commented on a diff in pull request #6012: Spark 3.3: Add a procedure to generate table changes

2022-12-01 Thread GitBox



flyrain commented on code in PR #6012:
URL: https://github.com/apache/iceberg/pull/6012#discussion_r1037488519


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java:
##
@@ -0,0 +1,271 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.procedures;
+
+import java.util.Arrays;
+import java.util.UUID;
+import org.apache.iceberg.ChangelogOperation;
+import org.apache.iceberg.MetadataColumns;
+import org.apache.iceberg.Snapshot;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.spark.SparkReadOptions;
+import org.apache.iceberg.spark.source.SparkChangelogTable;
+import org.apache.iceberg.util.DateTimeUtil;
+import org.apache.iceberg.util.SnapshotUtil;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter;
+import org.apache.spark.sql.expressions.Window;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.jetbrains.annotations.NotNull;
+
+public class GenerateChangesProcedure extends BaseProcedure {
+
+  public static SparkProcedures.ProcedureBuilder builder() {
+return new BaseProcedure.Builder() {
+  @Override
+  protected GenerateChangesProcedure doBuild() {
+return new GenerateChangesProcedure(tableCatalog());
+  }
+};
+  }
+
+  private static final ProcedureParameter[] PARAMETERS =
+  new ProcedureParameter[] {
+ProcedureParameter.required("table", DataTypes.StringType),
+// the snapshot ids input are ignored when the start/end timestamps 
are provided
+ProcedureParameter.optional("start_snapshot_id_exclusive", 
DataTypes.LongType),
+ProcedureParameter.optional("end_snapshot_id_inclusive", 
DataTypes.LongType),
+ProcedureParameter.optional("table_change_view", DataTypes.StringType),
+ProcedureParameter.optional("identifier_columns", 
DataTypes.StringType),
+ProcedureParameter.optional("start_timestamp", 
DataTypes.TimestampType),
+ProcedureParameter.optional("end_timestamp", DataTypes.TimestampType),

Review Comment:
   Changed to `options` in the procedure. Will add the timestamp range in 
another PR. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #6012: Spark 3.3: Add a procedure to generate table changes

2022-12-01 Thread GitBox



flyrain commented on code in PR #6012:
URL: https://github.com/apache/iceberg/pull/6012#discussion_r1037489027


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java:
##
@@ -0,0 +1,336 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.procedures;
+
+import static org.apache.iceberg.ChangelogOperation.DELETE;
+import static org.apache.iceberg.ChangelogOperation.INSERT;
+
+import java.util.Arrays;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Objects;
+import java.util.stream.Collectors;
+import org.apache.iceberg.ChangelogOperation;
+import org.apache.iceberg.MetadataColumns;
+import org.apache.iceberg.Snapshot;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.relocated.com.google.common.collect.Iterators;
+import org.apache.iceberg.spark.SparkReadOptions;
+import org.apache.iceberg.spark.source.SparkChangelogTable;
+import org.apache.iceberg.util.DateTimeUtil;
+import org.apache.iceberg.util.SnapshotUtil;
+import org.apache.spark.api.java.function.MapPartitionsFunction;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.encoders.RowEncoder;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.jetbrains.annotations.NotNull;
+
+public class GenerateChangesProcedure extends BaseProcedure {
+
+  public static SparkProcedures.ProcedureBuilder builder() {
+return new BaseProcedure.Builder() {
+  @Override
+  protected GenerateChangesProcedure doBuild() {
+return new GenerateChangesProcedure(tableCatalog());
+  }
+};
+  }
+
+  private static final ProcedureParameter[] PARAMETERS =
+  new ProcedureParameter[] {
+ProcedureParameter.required("table", DataTypes.StringType),
+// the snapshot ids input are ignored when the start/end timestamps 
are provided
+ProcedureParameter.optional("start_snapshot_id_exclusive", 
DataTypes.LongType),
+ProcedureParameter.optional("end_snapshot_id_inclusive", 
DataTypes.LongType),
+ProcedureParameter.optional("table_change_view", DataTypes.StringType),
+ProcedureParameter.optional("identifier_columns", 
DataTypes.StringType),
+ProcedureParameter.optional("start_timestamp", 
DataTypes.TimestampType),
+ProcedureParameter.optional("end_timestamp", DataTypes.TimestampType),
+  };
+
+  private static final StructType OUTPUT_TYPE =
+  new StructType(
+  new StructField[] {
+new StructField("view_name", DataTypes.StringType, false, 
Metadata.empty())
+  });
+
+  private GenerateChangesProcedure(TableCatalog tableCatalog) {
+super(tableCatalog);
+  }
+
+  @Override
+  public ProcedureParameter[] parameters() {
+return PARAMETERS;
+  }
+
+  @Override
+  public StructType outputType() {
+return OUTPUT_TYPE;
+  }
+
+  @Override
+  public InternalRow[] call(InternalRow args) {
+String tableName = args.getString(0);
+
+// Read data from the table.changes
+Dataset df = changelogRecords(tableName, args);
+
+// Compute the pre-image and post-images if the identifier columns are 
provided.
+if (!args.isNullAt(4)) {
+  String[] identifierColumns = args.getString(4).split(",");
+  if (identifierColumns.length > 0) {
+df = withUpdate(df, identifierColumns);
+  }
+}
+
+String viewName = viewName(args, tableName);
+
+// Create a view for users to query
+df.creat

[GitHub] [iceberg] flyrain commented on a diff in pull request #6012: Spark 3.3: Add a procedure to generate table changes

2022-12-01 Thread GitBox



flyrain commented on code in PR #6012:
URL: https://github.com/apache/iceberg/pull/6012#discussion_r1037489658


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java:
##
@@ -0,0 +1,336 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.procedures;
+
+import java.util.Arrays;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Objects;
+import java.util.stream.Collectors;
+import org.apache.iceberg.ChangelogOperation;
+import org.apache.iceberg.MetadataColumns;
+import org.apache.iceberg.Snapshot;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.relocated.com.google.common.collect.Iterators;
+import org.apache.iceberg.spark.SparkReadOptions;
+import org.apache.iceberg.spark.source.SparkChangelogTable;
+import org.apache.iceberg.util.DateTimeUtil;
+import org.apache.iceberg.util.SnapshotUtil;
+import org.apache.spark.api.java.function.MapPartitionsFunction;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.encoders.RowEncoder;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.jetbrains.annotations.NotNull;
+
+public class GenerateChangesProcedure extends BaseProcedure {
+
+  public static SparkProcedures.ProcedureBuilder builder() {
+return new BaseProcedure.Builder() {
+  @Override
+  protected GenerateChangesProcedure doBuild() {
+return new GenerateChangesProcedure(tableCatalog());
+  }
+};
+  }
+
+  private static final ProcedureParameter[] PARAMETERS =
+  new ProcedureParameter[] {

Review Comment:
   Made the change per suggestion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi merged pull request #6309: Spark 3.3: Consume arbitrary scans in SparkBatchQueryScan

2022-12-01 Thread GitBox



aokolnychyi merged PR #6309:
URL: https://github.com/apache/iceberg/pull/6309


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #6309: Spark 3.3: Consume arbitrary scans in SparkBatchQueryScan

2022-12-01 Thread GitBox



aokolnychyi commented on PR #6309:
URL: https://github.com/apache/iceberg/pull/6309#issuecomment-1334290414

   Thanks, @RussellSpitzer!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] Fokko opened a new pull request, #6342: Python: Introduce SchemaVisitorPerPrimitiveType

2022-12-01 Thread GitBox



Fokko opened a new pull request, #6342:
URL: https://github.com/apache/iceberg/pull/6342

   Instead of having another visitor go over the primitives, I think it is 
nicer to have an extended schema visitor that also goes over the primitive types


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi opened a new pull request, #6343: Spark 3.3: Remove unused RowDataRewriter

2022-12-01 Thread GitBox



aokolnychyi opened a new pull request, #6343:
URL: https://github.com/apache/iceberg/pull/6343

   This PR removed no longer used `RowDataRewriter`. This class was needed for 
initial data compaction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #6012: Spark 3.3: Add a procedure to generate table changes

2022-12-01 Thread GitBox



flyrain commented on code in PR #6012:
URL: https://github.com/apache/iceberg/pull/6012#discussion_r1037552480


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java:
##
@@ -53,6 +53,7 @@ private static Map> 
initProcedureBuilders() {
 mapBuilder.put("ancestors_of", AncestorsOfProcedure::builder);
 mapBuilder.put("register_table", RegisterTableProcedure::builder);
 mapBuilder.put("publish_changes", PublishChangesProcedure::builder);
+mapBuilder.put("generate_changes", GenerateChangesProcedure::builder);

Review Comment:
   Other options like `register_change_view` `create_changelog_view`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #5376: Core: Add readable metrics columns to files metadata tables

2022-12-01 Thread GitBox



szehon-ho commented on code in PR #5376:
URL: https://github.com/apache/iceberg/pull/5376#discussion_r1037554727


##
api/src/main/java/org/apache/iceberg/DataFile.java:
##
@@ -102,7 +102,8 @@ public interface DataFile extends ContentFile {
   int PARTITION_ID = 102;
   String PARTITION_NAME = "partition";
   String PARTITION_DOC = "Partition data tuple, schema based on the partition 
spec";
-  // NEXT ID TO ASSIGN: 142
+
+  int NEXT_ID_TO_ASSIGN = 142;

Review Comment:
   Changed to find the highest existing id from files table schema, and then 
assign ids from there



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #5376: Core: Add readable metrics columns to files metadata tables

2022-12-01 Thread GitBox



szehon-ho commented on code in PR #5376:
URL: https://github.com/apache/iceberg/pull/5376#discussion_r1037554935


##
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java:
##
@@ -817,4 +824,93 @@ public static Set reachableManifestPaths(Table 
table) {
 .map(ManifestFile::path)
 .collect(Collectors.toSet());
   }
+
+  public static GenericData.Record asMetadataRecordWithMetrics(
+  Table dataTable, GenericData.Record file) {
+return asMetadataRecordWithMetrics(dataTable, file, FileContent.DATA);
+  }
+
+  public static GenericData.Record asMetadataRecordWithMetrics(
+  Table dataTable, GenericData.Record file, FileContent content) {
+
+Table filesTable =
+MetadataTableUtils.createMetadataTableInstance(dataTable, 
MetadataTableType.FILES);
+
+GenericData.Record record =
+new GenericData.Record(AvroSchemaUtil.convert(filesTable.schema(), 
"dummy"));
+boolean isPartitioned = 
Partitioning.partitionType(dataTable).fields().size() != 0;
+int filesFields = isPartitioned ? 17 : 16;
+for (int i = 0; i < filesFields; i++) {
+  if (i == 0) {
+record.put(0, content.id());
+  } else if (i == 3) {
+record.put(3, 0); // spec id
+  } else {
+record.put(i, file.get(i));
+  }
+}
+record.put(
+isPartitioned ? 17 : 16,
+expectedReadableMetrics(

Review Comment:
   Did the select to simplify the tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi merged pull request #6343: Spark 3.3: Remove unused RowDataRewriter

2022-12-01 Thread GitBox



aokolnychyi merged PR #6343:
URL: https://github.com/apache/iceberg/pull/6343


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] abmo-x closed pull request #6301: Update Schema - should check if field is optional/required

2022-12-01 Thread GitBox



abmo-x closed pull request #6301: Update Schema - should check if field is 
optional/required
URL: https://github.com/apache/iceberg/pull/6301


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] rbalamohan commented on issue #6326: estimateStatistics cost mush time to compute stats

2022-12-01 Thread GitBox



rbalamohan commented on issue #6326:
URL: https://github.com/apache/iceberg/issues/6326#issuecomment-1334526348

   Check if increasing "iceberg.worker.num-threads" helps in this case. Default 
should be the number of processors available in the system. This can be 
increased by setting it as system property (try sending it via spark 
driver/executor jvm opt). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] commented on issue #3705: Explore spark struct streaming write iceberg and synchronize to hive Metastore

2022-12-01 Thread GitBox



github-actions[bot] commented on issue #3705:
URL: https://github.com/apache/iceberg/issues/3705#issuecomment-1334604152

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] closed issue #3705: Explore spark struct streaming write iceberg and synchronize to hive Metastore

2022-12-01 Thread GitBox



github-actions[bot] closed issue #3705: Explore spark struct streaming write 
iceberg and synchronize to hive Metastore
URL: https://github.com/apache/iceberg/issues/3705


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] stevenzwu commented on pull request #5967: Flink: Support read options in flink source

2022-12-01 Thread GitBox



stevenzwu commented on PR #5967:
URL: https://github.com/apache/iceberg/pull/5967#issuecomment-1334609528

   @hililiwei seems that the only remaining task is to add prefix like 
`connector.iceberg.` to the configs coming from Flink job configuration. 
everything else are clarified and aligned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #5376: Core: Add readable metrics columns to files metadata tables

2022-12-01 Thread GitBox



szehon-ho commented on code in PR #5376:
URL: https://github.com/apache/iceberg/pull/5376#discussion_r1037681151


##
core/src/main/java/org/apache/iceberg/MetricsUtil.java:
##
@@ -56,4 +72,270 @@ public static MetricsModes.MetricsMode metricsMode(
 String columnName = inputSchema.findColumnName(fieldId);
 return metricsConfig.columnMode(columnName);
   }
+
+  public static final List READABLE_COL_METRICS =
+  ImmutableList.of(
+  new ReadableMetricCol("column_size", f -> Types.LongType.get(), 
"Total size on disk"),
+  new ReadableMetricCol(
+  "value_count", f -> Types.LongType.get(), "Total count, 
including null and NaN"),
+  new ReadableMetricCol("null_value_count", f -> Types.LongType.get(), 
"Null value count"),
+  new ReadableMetricCol("nan_value_count", f -> Types.LongType.get(), 
"NaN value count"),
+  new ReadableMetricCol("lower_bound", Types.NestedField::type, "Lower 
bound"),
+  new ReadableMetricCol("upper_bound", Types.NestedField::type, "Upper 
bound"));
+
+  public static final String READABLE_METRICS = "readable_metrics";
+
+  public static class ReadableMetricCol {
+private final String name;
+private final Function typeFunction;
+private final String doc;
+
+ReadableMetricCol(String name, Function 
typeFunction, String doc) {
+  this.name = name;
+  this.typeFunction = typeFunction;
+  this.doc = doc;
+}
+
+String name() {
+  return name;
+}
+
+Type type(Types.NestedField field) {
+  return typeFunction.apply(field);
+}
+
+String doc() {
+  return doc;
+}
+  }
+
+  /**
+   * Represents a struct of metrics for a primitive column
+   *
+   * @param  primitive column type
+   */
+  public static class ReadableColMetricsStruct implements StructLike {
+
+private final String columnName;
+private final Long columnSize;
+private final Long valueCount;
+private final Long nullValueCount;
+private final Long nanValueCount;
+private final T lowerBound;
+private final T upperBound;
+private final Map projectionMap;
+
+public ReadableColMetricsStruct(
+String columnName,
+Long columnSize,
+Long valueCount,
+Long nullValueCount,
+Long nanValueCount,
+T lowerBound,
+T upperBound,
+Types.NestedField projection) {
+  this.columnName = columnName;
+  this.columnSize = columnSize;
+  this.valueCount = valueCount;
+  this.nullValueCount = nullValueCount;
+  this.nanValueCount = nanValueCount;
+  this.lowerBound = lowerBound;
+  this.upperBound = upperBound;
+  this.projectionMap = readableMetricsProjection(projection);
+}
+
+@Override
+public int size() {
+  return projectionMap.size();
+}
+
+@Override
+public  T get(int pos, Class javaClass) {
+  Object value = get(pos);
+  return value == null ? null : javaClass.cast(value);
+}
+
+@Override
+public  void set(int pos, T value) {
+  throw new UnsupportedOperationException("ReadableMetricsStruct is read 
only");
+}
+
+private Object get(int pos) {
+  int projectedPos = projectionMap.get(pos);
+  switch (projectedPos) {
+case 0:
+  return columnSize;
+case 1:
+  return valueCount;
+case 2:
+  return nullValueCount;
+case 3:
+  return nanValueCount;
+case 4:
+  return lowerBound;
+case 5:
+  return upperBound;
+default:
+  throw new IllegalArgumentException(
+  String.format("Invalid projected pos %d", projectedPos));
+  }
+}
+
+/** @return map of projected position to actual position of this struct's 
fields */
+private Map readableMetricsProjection(Types.NestedField 
projection) {
+  Map result = Maps.newHashMap();
+
+  Set projectedFields =
+  Sets.newHashSet(
+  projection.type().asStructType().fields().stream()
+  .map(Types.NestedField::name)
+  .collect(Collectors.toSet()));
+
+  int projectedIndex = 0;
+  for (int fieldIndex = 0; fieldIndex < READABLE_COL_METRICS.size(); 
fieldIndex++) {
+ReadableMetricCol readableMetric = 
READABLE_COL_METRICS.get(fieldIndex);
+
+if (projectedFields.contains(readableMetric.name())) {
+  result.put(projectedIndex, fieldIndex);
+  projectedIndex++;
+}
+  }
+  return result;
+}
+
+String columnName() {
+  return columnName;
+}
+  }
+
+  /**
+   * Represents a struct, consisting of all {@link ReadableColMetricsStruct} 
for all primitive
+   * columns of the table
+   */
+  public static class ReadableMetricsStruct implements StructLike {
+
+private final List columnMetrics;
+
+public ReadableMetricsStruct(List columnMetrics) {
+  this.columnMetrics = columnMetrics;
+}
+
+@Override

[GitHub] [iceberg] aokolnychyi opened a new pull request, #6345: Spark 3.3: Choose readers based on task types

2022-12-01 Thread GitBox



aokolnychyi opened a new pull request, #6345:
URL: https://github.com/apache/iceberg/pull/6345

   This PR adds `SparkPartitionReaderFactory` that creates readers based on 
tasks in input partitions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6345: Spark 3.3: Choose readers based on task types

2022-12-01 Thread GitBox



aokolnychyi commented on code in PR #6345:
URL: https://github.com/apache/iceberg/pull/6345#discussion_r1037687769


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java:
##
@@ -28,21 +28,48 @@
 import org.apache.iceberg.io.CloseableIterator;
 import org.apache.iceberg.io.InputFile;
 import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.spark.source.metrics.TaskNumDeletes;
+import org.apache.iceberg.spark.source.metrics.TaskNumSplits;
 import org.apache.spark.rdd.InputFileBlockHolder;
+import org.apache.spark.sql.connector.metric.CustomTaskMetric;
+import org.apache.spark.sql.connector.read.PartitionReader;
 import org.apache.spark.sql.vectorized.ColumnarBatch;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-class BatchDataReader extends BaseBatchReader {
+class BatchDataReader extends BaseBatchReader

Review Comment:
   This class was only used as `PartitionReader` in `SparkScan`, where we 
extended it, implemented `PartitionReader` and called the implementation as 
`BatchReader`.  After adding a common reader factory, we may have multiple 
batch readers now. That's why `BatchDataReader` seemed like a more accurate 
name than `BatchReader`. As there were no other places that used this class, I 
decided to implement `PartitionReader` directly here.
   
   Any feedback is welcome. See `SparkScan` below for old usage.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6345: Spark 3.3: Choose readers based on task types

2022-12-01 Thread GitBox



aokolnychyi commented on code in PR #6345:
URL: https://github.com/apache/iceberg/pull/6345#discussion_r1037687769


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java:
##
@@ -28,21 +28,48 @@
 import org.apache.iceberg.io.CloseableIterator;
 import org.apache.iceberg.io.InputFile;
 import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.spark.source.metrics.TaskNumDeletes;
+import org.apache.iceberg.spark.source.metrics.TaskNumSplits;
 import org.apache.spark.rdd.InputFileBlockHolder;
+import org.apache.spark.sql.connector.metric.CustomTaskMetric;
+import org.apache.spark.sql.connector.read.PartitionReader;
 import org.apache.spark.sql.vectorized.ColumnarBatch;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-class BatchDataReader extends BaseBatchReader {
+class BatchDataReader extends BaseBatchReader

Review Comment:
   This class was only used as `PartitionReader` in `SparkScan`, where we 
extended it, mixed `PartitionReader` and called the implementation as 
`BatchReader`.  After adding a common reader factory, we may have multiple 
batch readers now. That's why `BatchDataReader` seemed like a more accurate 
name than `BatchReader`. As there were no other places that used this class, I 
decided to implement `PartitionReader` directly here.
   
   Any feedback is welcome. See `SparkScan` below for old usage.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6345: Spark 3.3: Choose readers based on task types

2022-12-01 Thread GitBox



aokolnychyi commented on code in PR #6345:
URL: https://github.com/apache/iceberg/pull/6345#discussion_r1037688715


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java:
##
@@ -28,21 +28,48 @@
 import org.apache.iceberg.io.CloseableIterator;
 import org.apache.iceberg.io.InputFile;
 import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.spark.source.metrics.TaskNumDeletes;
+import org.apache.iceberg.spark.source.metrics.TaskNumSplits;
 import org.apache.spark.rdd.InputFileBlockHolder;
+import org.apache.spark.sql.connector.metric.CustomTaskMetric;
+import org.apache.spark.sql.connector.read.PartitionReader;
 import org.apache.spark.sql.vectorized.ColumnarBatch;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-class BatchDataReader extends BaseBatchReader {
+class BatchDataReader extends BaseBatchReader
+implements PartitionReader {
+
   private static final Logger LOG = 
LoggerFactory.getLogger(BatchDataReader.class);
 
+  private final long numSplits;
+
+  BatchDataReader(SparkInputPartition partition, int batchSize) {
+this(
+partition.table(),
+partition.taskGroup(),
+partition.expectedSchema(),
+partition.isCaseSensitive(),
+batchSize);
+  }
+
   BatchDataReader(
-  ScanTaskGroup task,

Review Comment:
   Most other readers have another order of parameters: table, taskGroup, 
expectedSchema, caseSensitive.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

1 2 >

1 - 100 of 132 matches

Mail list logo