I tried this:
java -Xmx5m -jar pdfbox-app-2.0.15.jar ExtractText Projectplan.pdf
and it worked.
I also tried your code in a small maven project with 2.0.15 and it
worked too, although it needed 10m. I tested using oracle 1.8.0_202 on w10.
Tilman
Am 10.05.2019 um 07:22 schrieb Søren Pedersen:
I have uploadet the PDF here: https://we.tl/t-lQusIcUiRM
I have testet with both version 2.0.13 and 2.0.15 of PDFBox, and I have run the
test on a machine with 16 GB of ram, where I allowed JVM to use 14 GB using the
-Xmx14g parameter.
I took a heap dump using JVisualVM when it used approx 12 GB of memory and I can see
that 98,3% of the size is taken up by int[]’s. When I dig into those they come from
featuredIndices in GlyphSubstitutionTable$LangSysTable -> langSysTable in
GlyphSubstitutionTable$LangSysRecord -> GlyphSubStitutionTable$LangSysRecord[].
I should also note that I run our app in a Docker container, like this:
docker run -d \
-p 8080:8080 \
-v /home/ec2-user/locate/build:/usr/build \
--name=locate \
openjdk:8 \
java -Xmx14g -Dserver.port=8080 -Dspring.profiles.active=prod
-Djdk.tls.useExtendedMasterSecret=false -jar /usr/build/project-web-1.3.0.war
Thanks a lot in advance!
Best regards,
Søren
On 9 May 2019, 17.59 +0200, Tilman Hausherr <[email protected]>, wrote:
please upload to a sharehoster and also mention what version you are using,
should be 2.0.15.
Tilman
------------------------------------------------------------------------
Gesendet mit der Telekom Mail App
<https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer>
--- Original-Nachricht ---
Von: Søren Pedersen
Betreff: Possible memory leak when extracting text?
Datum: 09.05.2019, 17:07 Uhr
An: [email protected]
Hi there
We have an application that can index the contents of PDF files, so that we
can use that for a search algorithm. We use the Apache PDFBox library for
extracting text from a PDF, like this (where inputStream is a
ByteArrayInputStream containing the contents of the PDF file):
PDFTextStripper pdfStripper = new PDFTextStripper();
pdDoc = PDDocument.load(inputStream,
MemoryUsageSetting.setupTempFileOnly());
String parsedText = pdfStripper.getText(pdDoc);
We ran into a sample PDF file, that seems to cause a memory leak, as we get
an OutOfMemoryError: Java heap space. I have attached the file to this
email (not sure if that works on a mailing list?)
Can someone try to extract the text in this PDF file, to confirm if there
is a memory leak, and maybe bring this to the attention of the developers?
Thanks a lot in advance!
Best regards,
Søren
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]