Re: AW: Possible memory leak when extracting text?

Søren Pedersen Thu, 09 May 2019 22:29:12 -0700

I have uploadet the PDF here: https://we.tl/t-lQusIcUiRM


I have testet with both version 2.0.13 and 2.0.15 of PDFBox, and I have run the 
test on a machine with 16 GB of ram, where I allowed JVM to use 14 GB using the 
-Xmx14g parameter.

I took a heap dump using JVisualVM when it used approx 12 GB of memory and I 
can see that 98,3% of the size is taken up by int[]’s. When I dig into those 
they come from featuredIndices in GlyphSubstitutionTable$LangSysTable -> 
langSysTable in GlyphSubstitutionTable$LangSysRecord -> 
GlyphSubStitutionTable$LangSysRecord[].

I should also note that I run our app in a Docker container, like this:

docker run -d \
-p 8080:8080 \
-v /home/ec2-user/locate/build:/usr/build \
--name=locate \
openjdk:8 \
java -Xmx14g -Dserver.port=8080 -Dspring.profiles.active=prod 
-Djdk.tls.useExtendedMasterSecret=false -jar /usr/build/project-web-1.3.0.war

Thanks a lot in advance!

Best regards,
Søren

On 9 May 2019, 17.59 +0200, Tilman Hausherr <[email protected]>, wrote:
> please upload to a sharehoster and also mention what version you are using,
> should be 2.0.15.
>
> Tilman
>
>
>
> ------------------------------------------------------------------------
> Gesendet mit der Telekom Mail App
> <https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer>
>
>
> --- Original-Nachricht ---
> Von: Søren Pedersen
> Betreff: Possible memory leak when extracting text?
> Datum: 09.05.2019, 17:07 Uhr
> An: [email protected]
>
>
> Hi there
>
> We have an application that can index the contents of PDF files, so that we
> can use that for a search algorithm. We use the Apache PDFBox library for
> extracting text from a PDF, like this (where inputStream is a
> ByteArrayInputStream containing the contents of the PDF file):
>
> PDFTextStripper pdfStripper = new PDFTextStripper();
> pdDoc = PDDocument.load(inputStream,
> MemoryUsageSetting.setupTempFileOnly());
> String parsedText = pdfStripper.getText(pdDoc);
>
> We ran into a sample PDF file, that seems to cause a memory leak, as we get
> an OutOfMemoryError: Java heap space. I have attached the file to this
> email (not sure if that works on a mailing list?)
>
> Can someone try to extract the text in this PDF file, to confirm if there
> is a memory leak, and maybe bring this to the attention of the developers?
>
> Thanks a lot in advance!
>
> Best regards,
> Søren

Re: AW: Possible memory leak when extracting text?

Reply via email to