[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

Tim Allison (Jira) Wed, 10 Jun 2020 13:35:23 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132700#comment-17132700
 ]


Tim Allison commented on TIKA-3097:
-----------------------------------

Not a dumb question at all.

So, in general, Tika tries to do sax like parsing and tries to emit sax events 
where it can.  This was one of the early core design principles.  However, 
there are some dependencies that load the full file into memory and then we 
emit the sax events.  For some file types, there are probably more efficient 
ways of doing this...for example, I _think_ there's an on demand parser in the 
works for PDFBox 3.0 (?) that should use a far smaller memory footprint for 
text extraction.  

We do what we can.  That said, there are sometimes when parsers use an unseemly 
amount of memory and that's an actual bug that can be easily fixed.  So, if you 
find particular files causing problems, let us know.



> Out of memory while parsing docx
> --------------------------------
>
>                 Key: TIKA-3097
>                 URL: https://issues.apache.org/jira/browse/TIKA-3097
>             Project: Tika
>          Issue Type: Bug
>          Components: core, parser
>    Affects Versions: 1.24
>            Reporter: suchendra
>            Priority: Major
>         Attachments: Screenshot from 2020-05-07 08-14-25.png, samplefile.txt, 
> test.docx
>
>
> I have written simple Scala code to extract the content from uploaded file 
> which is docx. JVM goes OOM when tika tries to parse the file. I have 
> configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both 
> with jar as well as in my code.
> Attached the file for reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

Reply via email to