Iachimoe created TIKA-4705:
------------------------------
Summary: resourceName of tar file in nested tarball should not
contain tarball's parent directories
Key: TIKA-4705
URL: https://issues.apache.org/jira/browse/TIKA-4705
Project: Tika
Issue Type: Improvement
Reporter: Iachimoe
Example structure:
test-nested-tarball.tar contains:
folderContainingTgz/inner/nested.tgz
The resource name for nested.tgz would be
`folderContainingTgz/inner/nested.tgz` , which is consistent with the general
behaviour for nested archives (e.g. zips).
However, if nested.tgz does not contain metadata specifying the name of the
nested file within, then that file will have a resourceName of
`folderContainingTgz/inner/nested.tar`. This is inconsistent with how other
nested archives behave, because parent folders should are generally only
included if they relate to the immediate parent archive. The parent archive of
nested.tgz in this example is test-nested-tarball.tar , and that is why it
makes sense for the folders to be included. However, the parent archive of
nested.tar is nested.tgz , and there is no folder called folderContainingTgz
within nested.tgz .
Draft pull request to follow with a unit test that will hopefully make the
issue clear, and a proposed fix.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)