Il-Pela commented on issue #9388:
URL: https://github.com/apache/iceberg/issues/9388#issuecomment-1880654793

   Hi @amogh-jahagirdar ,
   thanks for your clarifications, really helpful!
   
   > I'd need to inspect the code
   
   The code is as simple as it can be (I'm just testing Iceberg's 
functionalities), I have a mock DataFrame to which I added the _time_ column, 
used for partitioning, in this way:
   `df_with_timestamp = df_basic.withColumn('time', 
unix_timestamp(lit(timestamp), 'yyyy-MM-dd HH:mm:ss').cast("timestamp"))`
   
   > When dealing with object stores (such as S3, GCS etc) it's important to 
remember, that there really are no "folders" as such since these are not file 
systems in the traditional sense. "/" is used as a delimiter in paths since it 
mostly helps humans with a sense of hierarchy. You can think of these as 
objects associated with a key (like a kv store) where the keys have delimiters 
like "/".
   
   Yeah, you're right I haven't taken into consideration this behavior of 
object storages, my bad.
   In the last few days, I give _expire_shapshots_ a try:
   `spark.sql(f"CALL local.system.expire_snapshots( table => ''table_name", 
retain_last => 4, older_than => TIMESTAMP '{timestamp_4_days_ago}')")`
   
   with this Spark configuration:
   ```spark = SparkSession.builder
   .appName("MyApp")
   .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")
   .config("spark.sql.catalog.local.type", "hadoop")
   .config("spark.sql.catalog.local.warehouse", MY_WAREHOUSE_ON_GCS)
   .config("spark.sql.defaultCatalog", "local")
   .config("spark.sql.extensions", 
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
   .getOrCreate()
   ```
   
   And it worked as expected, deleting the snapshot older than 4 days. However, 
the reference to the "emptied" folder remained.
   For instance the folder for  _time_day=2023-12-27_ remained so I checked for 
its contents:
   - from the GCP UI there are no rows to display
   - from the CloudShell interface I have 2 hidden empty objects each of them 
with size 0 bytes, with creation timestamp equals to the execution of the 
expire_snapshot procedure that deleted the files in the folder for the day.
   
   Could they be some sort of "leftovers" left behind form the expire_snapshot 
procedure? Or maybe some kind of "soft_deletion" that my organization as in 
place in GCS? 
   If you have any suggestion it would be really appreciated, in the meanwhile 
I'll dig deeper into it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to