AlexisBRENON commented on PR #49768: URL: https://github.com/apache/airflow/pull/49768#issuecomment-3173591967
> I don’t think checking the creation time is necessary. The list of objects to be copied is determined by `S3ListOperator.execute`, so as long as you use the same prefix used in the `S3ToGCSOperator`, you should get the same result across subsequent tasks I'm sorry, I don't get your point. I have a source buckets, where client puts files everyday (all files at the root of the bucket). I need to copy these files to a GCS bucket, following the same (non-)hierarchy. So on day 1, there is `file1` in the source bucket, I copy it to the destination bucket, and so I want the operator to return `[file1]` (this is the new file to process today). On day 2, client pushed `file2` and so the source bucket contains both (`file1` and `file2`). I use this operator to sync buckets, hence copying `file2` in destination bucket. The only new file to process today is `file2` and so I want to be able to know that. That's why I expect this operator to return the list of copied files. If I use any of the `(S3|Gcs)ListOperator` it will return the list of all the files in the bucket. Filtering on the creation time may allow to get the files copied by a specific dag run. > It's not restarted during deferral, but it’s designed to be stateless and resilient to restarts. To preserve that statelessness with your proposed solution, you'd need to serialize the list of objects—which might not be ideal, as it could consume significant space in the metadata database. I understand that storing a long list of copied files may not be ideal (however, in the non-deferred setup, the list of copied files is stored anyway as the XCom value). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
