liapengpony commented on code in PR #2149:
URL: https://github.com/apache/hadoop/pull/2149#discussion_r2250007813
##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:
##########
@@ -4086,25 +4175,41 @@ public boolean exists(Path f) throws IOException {
}
/**
- * Override superclass so as to add statistic collection.
+ * Optimized probe for a path referencing a dir.
+ * Even though it is optimized to a single HEAD, applications
+ * should not over-use this method...it is all too common.
* {@inheritDoc}
*/
@Override
@SuppressWarnings("deprecation")
public boolean isDirectory(Path f) throws IOException {
Review Comment:
@steveloughran it looks the change to this function was meant to optimize
performance, but I am experiencing performance regression when upgrading spark
version from 3.1.2 to 3.5.1, and I found it caused by this change.
When I do a spark.read.parquet('s3a://path/to/1.parquet', ...,
's3a://path/to/10000.parquet') before this change, ONLY HEAD requests are sent
to build the DataFrame. However, after this change, LIST requests are sent,
which is significantly slower as I am reading from quite a lot of parquets.
The docstring "it is optimized to a single HEAD" also confuses me because
StatusProbeEnum.DIRECTORIES is just an alias for StatusProbeEnum.LIST_ONLY.
Am I missing anything here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]