Re: Parsing huge PDF (400Mb, 2700 pages)

Tim Allison Thu, 14 Nov 2019 06:07:47 -0800

CC'ing colleagues on PDFBox...any recommendations?

Sergey's recommendation is great for documents that can be parsed via
streaming.  However, PDFBox does not currently parse PDFs in a streaming
mode.  It builds the full document tree -- PDFBox colleagues let me know if
I'm wrong.


On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <[email protected]>
wrote:

> Hi,
> Are you using tika-server ? If yes and you can submit the data using a
> multipart/form-data payload then it may help, CXF (used by tika-server)
> should do the best effort at saving the multipart payloads to the temp
> locations on the disk, and thus minimize the memory requirements
>
> Cheers, Sergey
>
>
> On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <
> [email protected]> wrote:
>
>> Hi,
>>
>> My application handles all kind of documents (mainly PDFs). In a very few
>> cases, you might expect huge PDFs (< 500MB).
>>
>> By around 400MB I am hitting the wall, parsing takes ages (although quite
>> fast at the beginning). I've tried several ideas but none of them brought
>> the desired amelioration.
>>
>> I have the impression that memory plays a role. I have no more than 3GB
>> (and I think this should be enough as we are streaming the document and
>> using event based XML parser).
>>
>> Are they things I should be aware of?
>>
>> Any hint would be very welcome. Thanks and have a nice day,
>>
>> christian
>>
>>

Re: Parsing huge PDF (400Mb, 2700 pages)

Reply via email to