ssandona commented on issue #9923: URL: https://github.com/apache/iceberg/issues/9923#issuecomment-1987928914
Here a quick code to reproduce the error (pyspark): ``` from pyspark.sql.functions import col ICEBERG_DB_NAME="mydb" ICEBERG_TABLE_NAME_MOR="my_mor_table" # Define the number of columns num_columns = 1010 # Create column names column_names = [f"col{i}" for i in range(1, num_columns + 1)] # Create 5 rows data = [tuple([1] + [1] * (num_columns-1)), tuple([1] + [2] * (num_columns-1)), tuple([1] + [3] * (num_columns-1)), tuple([1] + [4] * (num_columns-1)), tuple([1] + [5] * (num_columns-1))] # Create a DataFrame with the specified row df_with_row = spark.createDataFrame(data, column_names) df_with_row.createOrReplaceTempView("table_input") spark.sql(f""" CREATE TABLE {ICEBERG_DB_NAME}.{ICEBERG_TABLE_NAME_MOR} USING iceberg PARTITIONED BY (col1) TBLPROPERTIES ( 'format-version'='2', 'write.delete.mode'='merge-on-read', 'write.update.mode'='merge-on-read', 'write.merge.mode'='merge-on-read', 'write.distribution-mode'='hash', 'write.delete.distribution-mode'='hash', 'write.update.distribution-mode'='hash', 'write.merge.distribution-mode'='hash' ) AS SELECT * FROM table_input """ ) spark.sql(f""" MERGE INTO {ICEBERG_DB_NAME}.{ICEBERG_TABLE_NAME_MOR} t USING (SELECT * FROM table_input WHERE col2 = 1) s ON t.col2 = s.col2 WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * """ ) spark.sql(f""" MERGE INTO {ICEBERG_DB_NAME}.{ICEBERG_TABLE_NAME_MOR} t USING (SELECT * FROM table_input WHERE col2 = 2) s ON t.col2 = s.col2 WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * """ ) ``` This fails: ``` CALL system.rewrite_position_delete_files( table => '{ICEBERG_DB_NAME}.{ICEBERG_TABLE_NAME_MOR}', options => map('rewrite-all', 'true') ) """ ) ``` Error: ``` An error was encountered: Multiple entries with same key: 1000=partition.col1 and 1000=row.col1000 Traceback (most recent call last): File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 1631, in sql return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self) File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ return_value = get_return_value( File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 185, in deco raise converted from None pyspark.errors.exceptions.captured.IllegalArgumentException: Multiple entries with same key: 1000=partition.col1 and 1000=row.col1000 ``` Also this fails: ``` spark.sql(f""" UPDATE {ICEBERG_DB_NAME}.{ICEBERG_TABLE_NAME_MOR} SET col2 = 6 WHERE col2 =1 """ ) ``` Error: ``` An error was encountered: Multiple entries with same key: 1000=_partition.col1 and 1000=col1000 Traceback (most recent call last): File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 1631, in sql return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self) File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ return_value = get_return_value( File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 185, in deco raise converted from None pyspark.errors.exceptions.captured.IllegalArgumentException: Multiple entries with same key: 1000=_partition.col1 and 1000=col1000 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org