liapengpony commented on code in PR #2149:
URL: https://github.com/apache/hadoop/pull/2149#discussion_r2251142761
##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:
##########
@@ -4086,25 +4175,41 @@ public boolean exists(Path f) throws IOException {
}
/**
- * Override superclass so as to add statistic collection.
+ * Optimized probe for a path referencing a dir.
+ * Even though it is optimized to a single HEAD, applications
+ * should not over-use this method...it is all too common.
* {@inheritDoc}
*/
@Override
@SuppressWarnings("deprecation")
public boolean isDirectory(Path f) throws IOException {
Review Comment:
Thanks for the reply!
> not good. file a PR, including what you can of the stack of checks.
Will do.
> how, why are yo providing a list of many may files, given that spark
expects to be working on a directory at a time?
We have an upstream service generating many parquet files, whose paths are
long and deep in hierarchies. I am responsible for ingesting them into a
Iceberg table with a PySpark cron job. Reading a list of many files instead of
a directory is to workaround the slow and recursive S3 (in our case, Ceph RGW)
LIST calls, so that I just need to do one LIST call before passing input to
spark.read.parquet().
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]