[ 
https://issues.apache.org/jira/browse/HADOOP-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13899589#comment-13899589
 ] 

Jason Lowe commented on HADOOP-10340:
-------------------------------------

Looking at the 1.x code, it appears it will also add directories to the results 
but somewhat inconsistently.  It will only add them if they are not immediately 
under the initial input path.  From the FileInputFormat.listStatus() code:

{code}
      FileStatus[] matches = fs.globStatus(p, inputFilter);
      if (matches == null) {
        errors.add(new IOException("Input path does not exist: " + p));
      } else if (matches.length == 0) {
        errors.add(new IOException("Input Pattern " + p + " matches 0 files"));
      } else {
        for (FileStatus globStat: matches) {
          if (globStat.isDir()) {
            for(FileStatus stat: fs.listStatus(globStat.getPath(),
                inputFilter)) {
              result.add(stat);
            }          
          } else {
            result.add(globStat);
          }
{code}

Note how it blindly just adds all the results of the second-level directory 
listing to the results rather than recursing the directory handling logic.  
That inconsistent directory handling in 1.x seems like a bug to me.  However 
note that it does not skip any directories -- it either adds the contents of 
the directory or the directory itself.  I don't think it's OK to skip the 
directory entirely when gathering the input or we could easily, silently drop 
input data for the job.

> FileInputFormat.listStatus() including directories in its results
> -----------------------------------------------------------------
>
>                 Key: HADOOP-10340
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10340
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Jason Dere
>
> Trying to track down HIVE-6401, where we see some "is not a file" errors 
> because getSplits() is giving us directories.  I believe the culprit is 
> FileInputFormat.listStatus():
> {code}
>                 if (recursive && stat.isDirectory()) {
>                   addInputPathRecursively(result, fs, stat.getPath(),
>                       inputFilter);
>                 } else {
>                   result.add(stat);
>                 }
> {code}
> Which seems to be allowing directories to be added to the results if 
> recursive is false.  Is this meant to return directories? If not, I think it 
> should look like this:
> {code}
>                 if (stat.isDirectory()) {
>                  if (recursive) {
>                   addInputPathRecursively(result, fs, stat.getPath(),
>                       inputFilter);
>                  }
>                 } else {
>                   result.add(stat);
>                 }
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to