liapengpony commented on code in PR #2149:
URL: https://github.com/apache/hadoop/pull/2149#discussion_r2250007813


##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:
##########
@@ -4086,25 +4175,41 @@ public boolean exists(Path f) throws IOException {
   }
 
   /**
-   * Override superclass so as to add statistic collection.
+   * Optimized probe for a path referencing a dir.
+   * Even though it is optimized to a single HEAD, applications
+   * should not over-use this method...it is all too common.
    * {@inheritDoc}
    */
   @Override
   @SuppressWarnings("deprecation")
   public boolean isDirectory(Path f) throws IOException {

Review Comment:
   @steveloughran it looks the change to this function was meant to optimize 
performance, but I am experiencing performance regression when upgrading spark 
version from 3.1.2 to 3.5.1, and I found it caused by this change.
   
   When I do a spark.read.parquet('s3a://path/to/1.parquet', ..., 
's3a://path/to/10000.parquet') before this change, ONLY HEAD requests are sent 
to build the DataFrame. However, after this change, LIST requests are sent, 
which is significantly slower as I am reading from quite a lot of parquets.
   
   The docstring "it is optimized to a single HEAD" also confuses me because 
StatusProbeEnum.DIRECTORIES is just an alias for StatusProbeEnum.LIST_ONLY.
   
   Am I missing anything here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to