goktugkose commented on issue #13763:
URL: https://github.com/apache/iceberg/issues/13763#issuecomment-3175870315

   We have encountered the same issue when rewriting data files. Similar to 
@hguercan's approach, we wrote a Spark SQL query to check whether different 
snapshots use the same data file. Also, we have noticed that pausing the Sink 
Connector does not stop all processes that belong to the sink tasks since 
INFO-level logs are still produced by those tasks. It seems that 
@kumarpritam863 addressed these two items in #13756. 
   
   Thanks for your effort 🚀 
   
   ```
   df= spark.sql(
       f"""
   SELECT 
       data_file.file_path AS file_path,
       COUNT(DISTINCT snapshot_id) AS distinct_snapshots
   FROM `{CATALOG}`.`{DATASET}`.`{TABLE_NAME}`.entries e
   GROUP BY data_file.file_path
   HAVING COUNT(DISTINCT snapshot_id) > 1;
       """
   )
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to