[ 
https://issues.apache.org/jira/browse/TIKA-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iachimoe updated TIKA-4705:
---------------------------
    Description: 
Example structure:

test-nested-tarball.tar contains:

 folderContainingTgz/inner/nested.tgz

 

The resource name for nested.tgz would be 
`folderContainingTgz/inner/nested.tgz` , which is consistent with the general 
behaviour for nested archives (e.g. zips).

However, if nested.tgz does not contain metadata specifying the name of the 
nested file within, then that file will have a resourceName of 
`folderContainingTgz/inner/nested.tar`. This is inconsistent with how other 
nested archives behave, because parent folders should are generally only 
included if they relate to the immediate parent archive. The parent archive of 
nested.tgz in this example is test-nested-tarball.tar , and that is why it 
makes sense for the folders to be included. However, the parent archive of 
nested.tar is nested.tgz , and there is no folder called folderContainingTgz 
within nested.tgz .

 

Draft pull request with a unit test that hopefully makes the issue clear, and a 
proposed fix at https://github.com/apache/tika/pull/2730/changes

 

  was:
Example structure:

test-nested-tarball.tar contains:

 folderContainingTgz/inner/nested.tgz

 

The resource name for nested.tgz would be 
`folderContainingTgz/inner/nested.tgz` , which is consistent with the general 
behaviour for nested archives (e.g. zips).

However, if nested.tgz does not contain metadata specifying the name of the 
nested file within, then that file will have a resourceName of 
`folderContainingTgz/inner/nested.tar`. This is inconsistent with how other 
nested archives behave, because parent folders should are generally only 
included if they relate to the immediate parent archive. The parent archive of 
nested.tgz in this example is test-nested-tarball.tar , and that is why it 
makes sense for the folders to be included. However, the parent archive of 
nested.tar is nested.tgz , and there is no folder called folderContainingTgz 
within nested.tgz .

 

Draft pull request to follow with a unit test that will hopefully make the 
issue clear, and a proposed fix.

 


> resourceName of tar file in nested tarball should not contain tarball's 
> parent directories
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-4705
>                 URL: https://issues.apache.org/jira/browse/TIKA-4705
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Iachimoe
>            Priority: Major
>
> Example structure:
> test-nested-tarball.tar contains:
>  folderContainingTgz/inner/nested.tgz
>  
> The resource name for nested.tgz would be 
> `folderContainingTgz/inner/nested.tgz` , which is consistent with the general 
> behaviour for nested archives (e.g. zips).
> However, if nested.tgz does not contain metadata specifying the name of the 
> nested file within, then that file will have a resourceName of 
> `folderContainingTgz/inner/nested.tar`. This is inconsistent with how other 
> nested archives behave, because parent folders should are generally only 
> included if they relate to the immediate parent archive. The parent archive 
> of nested.tgz in this example is test-nested-tarball.tar , and that is why it 
> makes sense for the folders to be included. However, the parent archive of 
> nested.tar is nested.tgz , and there is no folder called folderContainingTgz 
> within nested.tgz .
>  
> Draft pull request with a unit test that hopefully makes the issue clear, and 
> a proposed fix at https://github.com/apache/tika/pull/2730/changes
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to