[
https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132700#comment-17132700
]
Tim Allison commented on TIKA-3097:
-----------------------------------
Not a dumb question at all.
So, in general, Tika tries to do sax like parsing and tries to emit sax events
where it can. This was one of the early core design principles. However,
there are some dependencies that load the full file into memory and then we
emit the sax events. For some file types, there are probably more efficient
ways of doing this...for example, I _think_ there's an on demand parser in the
works for PDFBox 3.0 (?) that should use a far smaller memory footprint for
text extraction.
We do what we can. That said, there are sometimes when parsers use an unseemly
amount of memory and that's an actual bug that can be easily fixed. So, if you
find particular files causing problems, let us know.
> Out of memory while parsing docx
> --------------------------------
>
> Key: TIKA-3097
> URL: https://issues.apache.org/jira/browse/TIKA-3097
> Project: Tika
> Issue Type: Bug
> Components: core, parser
> Affects Versions: 1.24
> Reporter: suchendra
> Priority: Major
> Attachments: Screenshot from 2020-05-07 08-14-25.png, samplefile.txt,
> test.docx
>
>
> I have written simple Scala code to extract the content from uploaded file
> which is docx. JVM goes OOM when tika tries to parse the file. I have
> configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both
> with jar as well as in my code.
> Attached the file for reference.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)