Re: Possible memory leak when extracting text?

Tilman Hausherr Sat, 11 May 2019 02:05:23 -0700

The reason I mentioned 2.0.16 is because of this bug:
https://issues.apache.org/jira/browse/PDFBOX-4489

that one happened with a corrupt file. Yours isn't, but it might be ifit gets corrupted in transfer or in filtering.


2.0.16 snapshot is here:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.16-SNAPSHOT/

Tilman

Am 11.05.2019 um 06:54 schrieb Søren Pedersen:

Ok, that is very interesting. Thanks a lot for looking into this!

I am a bit baffled as to why we experience the memory leak then, but I guess I 
will have to dig more into it.

Best regards,
Søren
On 10 May 2019, 18.30 +0200, Andreas Lehmkuehler <[email protected]>, wrote:

Am 10.05.19 um 15:52 schrieb Søren Pedersen:

I have done some more testing, and I found that when I run on Windows there are 
no problems, but when I run on Linux I get the memory leak. Tilman, would you 
be able to run the same test on a Linux box? - or maybe using a Linux Docker 
container, like I showed originally?

I've extracted the text on linux (fedora 30, openjdk 1.8.0_212) without any
problems using

java -Xmx9m -jar pdfbox-app-2.0.15.jar ExtractText

where -Xmx9m is the smallest working value

Andreas

We would prefer to run our app on Linux, but this looks like a blocker for that 
unfortunately :(

Best regards,
Søren Pedersen
On 10 May 2019, 09.32 +0200, Søren Pedersen <[email protected]>, wrote:

Ok, thanks a lot for looking into this Tilman. I will try your suggestion and 
keep fiddling with it :)

Have a great weekend!
On 10 May 2019, 08.12 +0200, Tilman Hausherr <[email protected]>, wrote:

Am 10.05.2019 um 07:22 schrieb Søren Pedersen:

We have an application that can index the contents of PDF files, so that we
can use that for a search algorithm. We use the Apache PDFBox library for
extracting text from a PDF, like this (where inputStream is a
ByteArrayInputStream containing the contents of the PDF file):

PDFTextStripper pdfStripper = new PDFTextStripper();
pdDoc = PDDocument.load(inputStream,
MemoryUsageSetting.setupTempFileOnly());
String parsedText = pdfStripper.getText(pdDoc);


You can pass the byte[] directly to load(). Also make sure that the
bytes are not altered in any way, e.g. through a incorrectly configured
web downloading, or an incorrectly configured resource loading
("filtering" option must be false).


Also retry with 2.0.16 snapshot.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Possible memory leak when extracting text?

Reply via email to