[
https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134328#comment-17134328
]
Stefan Bodewig commented on TIKA-3110:
--------------------------------------
The short answer is: yes.
The longer version: don't assume anything when dealing with archiving formats
that have been around for decades. If only there was *the* tar format. :)
I will avoid the terms block and record as GNU tar and BSD tar seem to use them
differently.
Traditionally the tar format contains chunks of 512 bytes of data and groups 10
such chunks to a larger unit. Likely because it could be written to a tape more
easily if you wrote this bigger amount of data. Back then all tar archives
would consist of 5kB blocks and the archive would be padded by 0s to make it
reach a multiple of 5kB if the last entry didn't fill the unit entirely.
A lot of dialects spawned. Some tar tools will not fill the last unit. In order
to make things worse tar archives are supposed to signal EOF by two 512 byte
chunks of zeros. Some archivers create such markers, others only add one chunk,
others don't do either.
consumeRemainderOfLastBlock tries to consume the whole 5kB unit it is looking
at and if the stream ends permaturely, wll, then it has probably been created
by an archiver that didn't care and we won't complain.
> cannot extract metadata from 7z .tar archive
> --------------------------------------------
>
> Key: TIKA-3110
> URL: https://issues.apache.org/jira/browse/TIKA-3110
> Project: Tika
> Issue Type: Bug
> Components: mime, parser
> Affects Versions: 1.24.1
> Reporter: Alex
> Priority: Major
> Attachments: 7ztar.tar
>
>
> When I extracted metadata from .tar archive wich was created by linux bash
> it's works as I expect but if .tar archive was created by 7z I got an error:
> TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pkg.PackageParser@4d0f2471
> I created a project on GitHub for your convenience. It includes 2 files and
> code for play around: [https://github.com/AlexOkayJ/apache-tika-tar-issue.git]
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)