[jira] [Commented] (HADOOP-13230) S3A to optionally retain directory markers

ASF GitHub Bot (Jira) Mon, 04 Aug 2025 04:03:08 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18011843#comment-18011843
 ]


ASF GitHub Bot commented on HADOOP-13230:
-----------------------------------------

liapengpony commented on code in PR #2149:
URL: https://github.com/apache/hadoop/pull/2149#discussion_r2251142761


##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:
##########
@@ -4086,25 +4175,41 @@ public boolean exists(Path f) throws IOException {
   }
 
   /**
-   * Override superclass so as to add statistic collection.
+   * Optimized probe for a path referencing a dir.
+   * Even though it is optimized to a single HEAD, applications
+   * should not over-use this method...it is all too common.
    * {@inheritDoc}
    */
   @Override
   @SuppressWarnings("deprecation")
   public boolean isDirectory(Path f) throws IOException {

Review Comment:
   Thanks for the reply!
   
   > not good. file a PR, including what you can of the stack of checks.
   
   Will do.
   
   > how, why are yo providing a list of many may files, given that spark 
expects to be working on a directory at a time?
   
   We have an upstream service generating many parquet files, whose paths are 
long and deep in hierarchies. I am responsible for ingesting them into a 
Iceberg table with a PySpark cron job. Reading a list of many files instead of 
a directory is to workaround the slow and recursive S3 (in our case, Ceph RGW) 
LIST calls, so that I just need to do one LIST call before passing input to 
spark.read.parquet().





> S3A to optionally retain directory markers
> ------------------------------------------
>
>                 Key: HADOOP-13230
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13230
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.9.0
>            Reporter: Aaron Fabbri
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.3.1
>
>         Attachments: 2020-02-Fixing the S3A directory marker problem.pdf
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Users of s3a may not realize that, in some cases, it does not interoperate 
> well with other s3 tools, such as the AWS CLI.  (See HIVE-13778, IMPALA-3558).
> Specifically, if a user:
> - Creates an empty directory with hadoop fs -mkdir s3a://bucket/path
> - Copies data into that directory via another tool, i.e. aws cli.
> - Tries to access the data in that directory with any Hadoop software.
> Then the last step fails because the fake empty directory blob that s3a wrote 
> in the first step, causes s3a (listStatus() etc.) to continue to treat that 
> directory as empty, even though the second step was supposed to populate the 
> directory with data.
> I wanted to document this fact for users. We may mark this as not-fix, "by 
> design".. May also be interesting to brainstorm solutions and/or a config 
> option to change the behavior if folks care.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-13230) S3A to optionally retain directory markers

Reply via email to