[I] [Java] Enhancements for Java Dataset API [arrow-java]

via GitHub Tue, 26 Nov 2024 11:16:41 -0800


chitralverma opened a new issue, #154:
URL: https://github.com/apache/arrow-java/issues/154

### Describe the enhancement requested

Some important changes are suggested in the list below to improve the
developer experience with the Dataset API of java/arrow. Most of these
suggestions if implemented will lead to consistency with the [pyarrow dataset
API](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset).

1. Support for providing Filesystem options like access_key etc.
programmatically. Currently only env vars are supported.
2. Support for globbed paths and directories
3. Excluding invalid files
4. Additional documentation for already implemented functionality
1. Reading and writing to remote/ cloud stores (HDFS, S3, GCS ...)
2. Clarification of behaviour when reading multiple files. Why 2 or more
files supplied, they may have different schema. Currently, only the schema of
the last files is shown by `.inspect()` and this is not documented anywhere.
This behaviour is the same in pyarrow. Maybe it's a good idea to allow users to
provide a strategy like Error, Merge, LastFile etc.
3. Reading and writing partitioned datasets
4. Difference between `FileSystemDatasetFactory.inspect()` and
`FileSystemDatasetFactory.finish().newScan(...).schema()`. Which one to use in
which case?
5. Env vars for Filesystem are not documented

Please let me know if the above make sense, I can help with PRs for the same.

### Component(s)

Java

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Java] Enhancements for Java Dataset API [arrow-java]

Reply via email to