chitralverma opened a new issue, #154: URL: https://github.com/apache/arrow-java/issues/154
### Describe the enhancement requested Some important changes are suggested in the list below to improve the developer experience with the Dataset API of java/arrow. Most of these suggestions if implemented will lead to consistency with the [pyarrow dataset API](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset). 1. Support for providing Filesystem options like access_key etc. programmatically. Currently only env vars are supported. 2. Support for globbed paths and directories 3. Excluding invalid files 4. Additional documentation for already implemented functionality 1. Reading and writing to remote/ cloud stores (HDFS, S3, GCS ...) 2. Clarification of behaviour when reading multiple files. Why 2 or more files supplied, they may have different schema. Currently, only the schema of the last files is shown by `.inspect()` and this is not documented anywhere. This behaviour is the same in pyarrow. Maybe it's a good idea to allow users to provide a strategy like Error, Merge, LastFile etc. 3. Reading and writing partitioned datasets 4. Difference between `FileSystemDatasetFactory.inspect()` and `FileSystemDatasetFactory.finish().newScan(...).schema()`. Which one to use in which case? 5. Env vars for Filesystem are not documented Please let me know if the above make sense, I can help with PRs for the same. ### Component(s) Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org