I tried this:

java -Xmx5m -jar pdfbox-app-2.0.15.jar ExtractText Projectplan.pdf

and it worked.

I also tried your code in a small maven project with 2.0.15 and it worked too, although it needed 10m. I tested using oracle 1.8.0_202 on w10.

Tilman

Am 10.05.2019 um 07:22 schrieb Søren Pedersen:
I have uploadet the PDF here: https://we.tl/t-lQusIcUiRM

I have testet with both version 2.0.13 and 2.0.15 of PDFBox, and I have run the 
test on a machine with 16 GB of ram, where I allowed JVM to use 14 GB using the 
-Xmx14g parameter.

I took a heap dump using JVisualVM when it used approx 12 GB of memory and I can see 
that 98,3% of the size is taken up by int[]’s. When I dig into those they come from 
featuredIndices in GlyphSubstitutionTable$LangSysTable -> langSysTable in 
GlyphSubstitutionTable$LangSysRecord -> GlyphSubStitutionTable$LangSysRecord[].

I should also note that I run our app in a Docker container, like this:

docker run -d \
-p 8080:8080 \
-v /home/ec2-user/locate/build:/usr/build \
--name=locate \
openjdk:8 \
java -Xmx14g -Dserver.port=8080 -Dspring.profiles.active=prod 
-Djdk.tls.useExtendedMasterSecret=false -jar /usr/build/project-web-1.3.0.war

Thanks a lot in advance!

Best regards,
Søren

On 9 May 2019, 17.59 +0200, Tilman Hausherr <[email protected]>, wrote:
please upload to a sharehoster and also mention what version you are using,
should be 2.0.15.

Tilman



------------------------------------------------------------------------
Gesendet mit der Telekom Mail App
<https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer>


--- Original-Nachricht ---
Von: Søren Pedersen
Betreff: Possible memory leak when extracting text?
Datum: 09.05.2019, 17:07 Uhr
An: [email protected]


Hi there

We have an application that can index the contents of PDF files, so that we
can use that for a search algorithm. We use the Apache PDFBox library for
extracting text from a PDF, like this (where inputStream is a
ByteArrayInputStream containing the contents of the PDF file):

PDFTextStripper pdfStripper = new PDFTextStripper();
pdDoc = PDDocument.load(inputStream,
MemoryUsageSetting.setupTempFileOnly());
String parsedText = pdfStripper.getText(pdDoc);

We ran into a sample PDF file, that seems to cause a memory leak, as we get
an OutOfMemoryError: Java heap space. I have attached the file to this
email (not sure if that works on a mailing list?)

Can someone try to extract the text in this PDF file, to confirm if there
is a memory leak, and maybe bring this to the attention of the developers?

Thanks a lot in advance!

Best regards,
Søren



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to