goktugkose commented on issue #13763:
URL: https://github.com/apache/iceberg/issues/13763#issuecomment-3175870315
We have encountered the same issue when rewriting data files. Similar to
@hguercan's approach, we wrote a Spark SQL query to check whether different
snapshots use the same data file. Also, we have noticed that pausing the Sink
Connector does not stop all processes that belong to the sink tasks since
INFO-level logs are still produced by those tasks. It seems that
@kumarpritam863 addressed these two items in #13756.
Thanks for your effort 🚀
```
df= spark.sql(
f"""
SELECT
data_file.file_path AS file_path,
COUNT(DISTINCT snapshot_id) AS distinct_snapshots
FROM `{CATALOG}`.`{DATASET}`.`{TABLE_NAME}`.entries e
GROUP BY data_file.file_path
HAVING COUNT(DISTINCT snapshot_id) > 1;
"""
)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]