[I] [python][dataset] Write dataset : 'ignore' already downloaded files [arrow]

via GitHub Tue, 21 Jan 2025 05:30:04 -0800


Plenitude-ai opened a new issue, #45321:
URL: https://github.com/apache/arrow/issues/45321


   ### Describe the enhancement requested
   
   In 
[pyarrow.dataset.write_dataset()](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html),
 there is 3 options for the argument `existing_data_behavior` :
   `‘error’ | ‘overwrite_or_ignore’ | ‘delete_matching’`
   
   I'd like to have a new one:**`‘ignore‘`**
   
   As from the description of `‘overwrite_or_ignore‘`:
   `... This behavior [...] will allow for an append workflow.`
   I really like this concept, and I would find it perfect it if was possible 
to **not download again** files that have already been downloaded, and hence 
expanding this "append" philosophy in a broader way.
   In case we want to keep a dataset up to date from another source, this would 
allow to download only the new data, instead of downloading every data point, 
and therefore wasting time/bandswith on already downloaded data
   I don't know if this could work, and if it is the right place for such an 
option/use case.
   
   ### Appendix :
   As an illustration of what I mean, this is what I did to compare 2 datasets, 
and download only the not-already present data :
   ```python
   import os
   import pyarrow
   import pyarrow.dataset
   import fsspec
   
   def update_local_dataset(
       remote_dataset: pyarrow.dataset.Dataset,
       local_dataset: pyarrow.dataset.Dataset,
   ):
       # It would also be great if we could reuse the dataset's filesystems
       # something like dataset.filesystem
       s3_fs: fsspec.AbstractFileSystem = s3_tool.get_s3_fs_from_config()
       local_fs: fsspec.AbstractFileSystem = fsspec.filesystem("local")
       logger.info("Updating local clicklog dataset...")
   
       remote_base_dir = 
os.path.dirname(next(remote_dataset.get_fragments()).path)
       remote_filenames= set(map(os.path.basename, [fragmnnt.path for fragmnnt 
in remote_dataset.get_fragments()]))
   
       local_base_dir = 
os.path.dirname(next(local_dataset.get_fragments()).path)
       local_filenames= set(map(os.path.basename, [fragmnnt.path for fragmnnt 
in local_dataset.get_fragments()]))
   
       to_dl_filenames = remote_filenames.difference(local_filenames)
       for filename in tqdm.tqdm(to_dl_filenames):
           remote_filepath = os.path.join(remote_base_dir, filename)
           local_filepath = os.path.join(local_base_dir, filename)
           s3_fs.get_file(remote_filepath, local_filepath)
       logger.info("Updated local dataset !")
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [python][dataset] Write dataset : 'ignore' already downloaded files [arrow]

Reply via email to