AlexisBRENON commented on PR #49768:
URL: https://github.com/apache/airflow/pull/49768#issuecomment-3173591967

   > I don’t think checking the creation time is necessary. The list of objects 
to be copied is determined by `S3ListOperator.execute`, so as long as you use 
the same prefix used in the `S3ToGCSOperator`, you should get the same result 
across subsequent tasks
   
   I'm sorry, I don't get your point.
   I have a source buckets, where client puts files everyday (all files at the 
root of the bucket).
   I need to copy these files to a GCS bucket, following the same 
(non-)hierarchy.
   
   So on day 1, there is `file1` in the source bucket, I copy it to the 
destination bucket, and so I want the operator to return `[file1]` (this is the 
new file to process today).
   
   On day 2, client pushed `file2` and so the source bucket contains both 
(`file1` and `file2`). I use this operator to sync buckets, hence copying 
`file2` in destination bucket. The only new file to process today is `file2` 
and so I want to be able to know that. That's why I expect this operator to 
return the list of copied files.
   
   If I use any of the `(S3|Gcs)ListOperator` it will return the list of all 
the files in the bucket. Filtering on the creation time may allow to get the 
files copied by a specific dag run.
   
   > It's not restarted during deferral, but it’s designed to be stateless and 
resilient to restarts. To preserve that statelessness with your proposed 
solution, you'd need to serialize the list of objects—which might not be ideal, 
as it could consume significant space in the metadata database.
   
   I understand that storing a long list of copied files may not be ideal 
(however, in the non-deferred setup, the list of copied files is stored anyway 
as the XCom value).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to