[
https://issues.apache.org/jira/browse/TIKA-2995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979298#comment-16979298
]
Tim Allison commented on TIKA-2995:
-----------------------------------
I'm happy to bump the markLimit. What do others think?
You _should_ be able to configure it via a tika_config.xml along these lines:
{noformat}
<properties>
<detectors>
<detector class="org.apache.tika.detect.OverrideDetector"/>
<detector
class="org.apache.tika.parser.microsoft.POIFSContainerDetector">
<params>
<param name="markLimit" type="int">134217728</param>
</params>
</detector>
<detector class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
<detector class="org.gagravarr.tika.OggDetector"/>
<detector class="org.apache.tika.mime.MimeTypes"/>
</detectors>
</properties>
{noformat}
> markLimit too small in
> org.apache.tika.parser.microsoft.POIFSContainerDetector
> --------------------------------------------------------------------------------
>
> Key: TIKA-2995
> URL: https://issues.apache.org/jira/browse/TIKA-2995
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.22
> Reporter: Tim Barrett
> Priority: Major
>
> Tika fails to parse large msg files (msg files > 16MB in size). This is
> because the property markLimit in POIFSContainerDetector is set to 16MB.
> Although there is a public set method in the class, this is not called within
> Tika as we use the DefaultDetector, which encapsulates the use of
> POIFSContainerDetector.
> As a workaround we have made the following change in POIFSContainerDetector:
>
> @Field
> // private int markLimit = 16 * 1024 * 1024;
>
> *private* *int* markLimit = 128 * 1024 * 1024;
> Could a better fix to have the DefaultDetector use setMarkLimit to a higher
> value? msg files with attachments are often greater than 16MB in size.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)