dramaticlly opened a new issue, #6888:
URL: https://github.com/apache/iceberg/issues/6888

   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   
   
   Add files by default check for duplicate when importing external written 
data into iceberg tables.
   
   It read the `data_file.file_path` from entries table when comparing file 
path provided in source_table per 
https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L532-L541.
 However the manifest entry will have status = 2 when deleted so it incorrectly 
considered as deleted file path as duplicate and prevent files to be added
   
   Repro
   ```scala
   //1. create iceberg table
   val tableId = "iceberg.hongyue_zhang.repro"
   spark.sql(s"""CREATE TABLE if not exists
   $tableId
   ( id bigint, log_dateint bigint, request_dateint bigint )
   USING iceberg
   PARTITIONED BY (log_dateint, request_dateint)""");
   //2. insert some data 
   val insertSQL = s"INSERT INTO TABLE $tableId VALUES (1, 20230220,20230221);"
   spark.sql(insertSQL).show
   //3. delete from iceberg table
   val deleteSQL = s"DELETE FOM $tableId")
   spark.sql(deleteSQL);
   //4. using add files to add them back and run into exception 
   val tableIdWoCatalog = tableId.split("\\.").drop(1).mkString(".")
   val parquetFilePath = "s3a://bucket/warehouse/hongyue_zhang.db/repro/data"
   val addFilesSQL = s"""
   CALL iceberg.system.add_files(
   table =>'$tableIdWoCatalog',
   source_table => '`parquet`.`$parquetFilePath`'
   )
   """.stripMargin
   
   spark.sql(addFilesSQL).show
   
   
   java.lang.IllegalStateException: Cannot complete import because data files 
to be imported already exist within the target table: 
s3a://bucket/warehouse/hongyue_zhang.db/r
   
epro/data/log_dateint=20230220/request_dateint=20230211/00196-17-1f414c2a-c7aa-4f22-887e-f7126a68e9a0-00001.parquet.
  This is disabled by default as Iceberg is not designed for mulitple references
    to the same file within the same table.  If you are sure, you may set 
'check_duplicate_files' to false to force the import.
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to