chitralverma opened a new issue, #154:
URL: https://github.com/apache/arrow-java/issues/154

   ### Describe the enhancement requested
   
   Some important changes are suggested in the list below to improve the 
developer experience with the Dataset API of java/arrow. Most of these 
suggestions if implemented will lead to consistency with the [pyarrow dataset 
API](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset).
   
   1. Support for providing Filesystem options like access_key etc. 
programmatically. Currently only env vars are supported.
   2. Support for globbed paths and directories
   3. Excluding invalid files
   4. Additional documentation for already implemented functionality
       1. Reading and writing to remote/ cloud stores (HDFS, S3, GCS ...)
       2. Clarification of behaviour when reading multiple files. Why 2 or more 
files supplied, they may have different schema. Currently, only the schema of 
the last files is shown by `.inspect()` and this is not documented anywhere. 
This behaviour is the same in pyarrow. Maybe it's a good idea to allow users to 
provide a strategy like Error, Merge, LastFile etc.
       3. Reading and writing partitioned datasets
       4. Difference between `FileSystemDatasetFactory.inspect()` and 
`FileSystemDatasetFactory.finish().newScan(...).schema()`. Which one to use in 
which case?
       5. Env vars for Filesystem are not documented
   
   Please let me know if the above make sense, I can help with PRs for the same.
   
   ### Component(s)
   
   Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to