[jira] [Comment Edited] (TIKA-3110) cannot extract metadata from 7z .tar archive

Tim Allison (Jira) Wed, 10 Jun 2020 14:30:24 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132728#comment-17132728
 ]


Tim Allison edited comment on TIKA-3110 at 6/10/20, 9:29 PM:
-------------------------------------------------------------

{noformat}
Caused by: java.io.IOException: tried to skip 7168 but actually skipped: 0
        at org.apache.tika.io.TikaInputStream.skip(TikaInputStream.java:717)
        at 
org.apache.commons.io.input.ProxyInputStream.skip(ProxyInputStream.java:117)
        at org.apache.commons.compress.utils.IOUtils.skip(IOUtils.java:113)
        at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.consumeRemainderOfLastBlock(TarArchiveInputStream.java:987)
        at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getRecord(TarArchiveInputStream.java:487)
        at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:360)
        at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:799)

{noformat}

This is a regression (or new feature?) going from 1.24 -> 1.24.1.

For the sake of security, I changed TikaInputStream's skip() to require that 
the given number of bytes actually be skipped.  This prevents infinite loops in 
parsers that forget to check and/or trust FileInputStream.skip() which no one 
ever, ever should.

My sense was that there may be some mp4's out there that will cause problems 
(e.g. they sometimes can end mid frame), and I'm now thinking we hit this 
earlier with .tar files.

[~bodewig] would you or a colleague on commons-compress know if we should 
expect this behavior for tar files...where they allege they have more data but 
actually don't. 

In short, is this something we should throw an exception for or should we 
happily let the tar file allege it has more bytes than it does?


was (Author: [email protected]):
{{noformat}}
Caused by: java.io.IOException: tried to skip 7168 but actually skipped: 0
        at org.apache.tika.io.TikaInputStream.skip(TikaInputStream.java:717)
        at 
org.apache.commons.io.input.ProxyInputStream.skip(ProxyInputStream.java:117)
        at org.apache.commons.compress.utils.IOUtils.skip(IOUtils.java:113)
        at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.consumeRemainderOfLastBlock(TarArchiveInputStream.java:987)
        at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getRecord(TarArchiveInputStream.java:487)
        at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:360)
        at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:799)

{{noformat}}

This is a regression (or new feature?) going from 1.24 -> 1.24.1.

For the sake of security, I changed TikaInputStream's skip() to require that 
the given number of bytes actually be skipped.  This prevents infinite loops in 
parsers that forget to check and/or trust FileInputStream.skip() which no one 
ever, ever should.

My sense was that there may be some mp4's out there that will cause problems 
(e.g. they sometimes can end mid frame), and I'm now thinking we hit this 
earlier with .tar files.

[~bodewig] would you or a colleague on commons-compress know if we should 
expect this behavior for tar files...where they allege they have more data but 
actually don't. 

In short, is this something we should throw an exception for or should we 
happily let the tar file allege it has more bytes than it does?

> cannot extract metadata from 7z .tar archive
> --------------------------------------------
>
>                 Key: TIKA-3110
>                 URL: https://issues.apache.org/jira/browse/TIKA-3110
>             Project: Tika
>          Issue Type: Bug
>          Components: mime, parser
>    Affects Versions: 1.24.1
>            Reporter: Alex
>            Priority: Major
>
> When I extracted metadata from .tar archive wich was created by linux bash 
> it's works as I expect but if .tar archive was created by 7z I got an error:
>  TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.pkg.PackageParser@4d0f2471 
> I created a project on GitHub for your convenience. It includes 2 files and 
> code for play around: [https://github.com/AlexOkayJ/apache-tika-tar-issue.git]
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3110) cannot extract metadata from 7z .tar archive

Reply via email to