Il-Pela commented on issue #9388: URL: https://github.com/apache/iceberg/issues/9388#issuecomment-1880654793
Hi @amogh-jahagirdar , thanks for your clarifications, really helpful! > I'd need to inspect the code The code is as simple as it can be (I'm just testing Iceberg's functionalities), I have a mock DataFrame to which I added the _time_ column, used for partitioning, in this way: `df_with_timestamp = df_basic.withColumn('time', unix_timestamp(lit(timestamp), 'yyyy-MM-dd HH:mm:ss').cast("timestamp"))` > When dealing with object stores (such as S3, GCS etc) it's important to remember, that there really are no "folders" as such since these are not file systems in the traditional sense. "/" is used as a delimiter in paths since it mostly helps humans with a sense of hierarchy. You can think of these as objects associated with a key (like a kv store) where the keys have delimiters like "/". Yeah, you're right I haven't taken into consideration this behavior of object storages, my bad. In the last few days, I give _expire_shapshots_ a try: `spark.sql(f"CALL local.system.expire_snapshots( table => ''table_name", retain_last => 4, older_than => TIMESTAMP '{timestamp_4_days_ago}')")` with this Spark configuration: ```spark = SparkSession.builder .appName("MyApp") .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.local.type", "hadoop") .config("spark.sql.catalog.local.warehouse", MY_WAREHOUSE_ON_GCS) .config("spark.sql.defaultCatalog", "local") .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .getOrCreate() ``` And it worked as expected, deleting the snapshot older than 4 days. However, the reference to the "emptied" folder remained. For instance the folder for _time_day=2023-12-27_ remained so I checked for its contents: - from the GCP UI there are no rows to display - from the CloudShell interface I have 2 hidden empty objects each of them with size 0 bytes, with creation timestamp equals to the execution of the expire_snapshot procedure that deleted the files in the folder for the day. Could they be some sort of "leftovers" left behind form the expire_snapshot procedure? Or maybe some kind of "soft_deletion" that my organization as in place in GCS? If you have any suggestion it would be really appreciated, in the meanwhile I'll dig deeper into it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org