CC'ing colleagues on PDFBox...any recommendations? Sergey's recommendation is great for documents that can be parsed via streaming. However, PDFBox does not currently parse PDFs in a streaming mode. It builds the full document tree -- PDFBox colleagues let me know if I'm wrong.
On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <[email protected]> wrote: > Hi, > Are you using tika-server ? If yes and you can submit the data using a > multipart/form-data payload then it may help, CXF (used by tika-server) > should do the best effort at saving the multipart payloads to the temp > locations on the disk, and thus minimize the memory requirements > > Cheers, Sergey > > > On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) < > [email protected]> wrote: > >> Hi, >> >> My application handles all kind of documents (mainly PDFs). In a very few >> cases, you might expect huge PDFs (< 500MB). >> >> By around 400MB I am hitting the wall, parsing takes ages (although quite >> fast at the beginning). I've tried several ideas but none of them brought >> the desired amelioration. >> >> I have the impression that memory plays a role. I have no more than 3GB >> (and I think this should be enough as we are streaming the document and >> using event based XML parser). >> >> Are they things I should be aware of? >> >> Any hint would be very welcome. Thanks and have a nice day, >> >> christian >> >>

