Plenitude-ai opened a new issue, #45321: URL: https://github.com/apache/arrow/issues/45321
### Describe the enhancement requested In [pyarrow.dataset.write_dataset()](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html), there is 3 options for the argument `existing_data_behavior` : `‘error’ | ‘overwrite_or_ignore’ | ‘delete_matching’` I'd like to have a new one:**`‘ignore‘`** As from the description of `‘overwrite_or_ignore‘`: `... This behavior [...] will allow for an append workflow.` I really like this concept, and I would find it perfect it if was possible to **not download again** files that have already been downloaded, and hence expanding this "append" philosophy in a broader way. In case we want to keep a dataset up to date from another source, this would allow to download only the new data, instead of downloading every data point, and therefore wasting time/bandswith on already downloaded data I don't know if this could work, and if it is the right place for such an option/use case. ### Appendix : As an illustration of what I mean, this is what I did to compare 2 datasets, and download only the not-already present data : ```python import os import pyarrow import pyarrow.dataset import fsspec def update_local_dataset( remote_dataset: pyarrow.dataset.Dataset, local_dataset: pyarrow.dataset.Dataset, ): # It would also be great if we could reuse the dataset's filesystems # something like dataset.filesystem s3_fs: fsspec.AbstractFileSystem = s3_tool.get_s3_fs_from_config() local_fs: fsspec.AbstractFileSystem = fsspec.filesystem("local") logger.info("Updating local clicklog dataset...") remote_base_dir = os.path.dirname(next(remote_dataset.get_fragments()).path) remote_filenames= set(map(os.path.basename, [fragmnnt.path for fragmnnt in remote_dataset.get_fragments()])) local_base_dir = os.path.dirname(next(local_dataset.get_fragments()).path) local_filenames= set(map(os.path.basename, [fragmnnt.path for fragmnnt in local_dataset.get_fragments()])) to_dl_filenames = remote_filenames.difference(local_filenames) for filename in tqdm.tqdm(to_dl_filenames): remote_filepath = os.path.join(remote_base_dir, filename) local_filepath = os.path.join(local_base_dir, filename) s3_fs.get_file(remote_filepath, local_filepath) logger.info("Updated local dataset !") ``` ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org