I have uploadet the PDF here: https://we.tl/t-lQusIcUiRM
I have testet with both version 2.0.13 and 2.0.15 of PDFBox, and I have run the test on a machine with 16 GB of ram, where I allowed JVM to use 14 GB using the -Xmx14g parameter. I took a heap dump using JVisualVM when it used approx 12 GB of memory and I can see that 98,3% of the size is taken up by int[]’s. When I dig into those they come from featuredIndices in GlyphSubstitutionTable$LangSysTable -> langSysTable in GlyphSubstitutionTable$LangSysRecord -> GlyphSubStitutionTable$LangSysRecord[]. I should also note that I run our app in a Docker container, like this: docker run -d \ -p 8080:8080 \ -v /home/ec2-user/locate/build:/usr/build \ --name=locate \ openjdk:8 \ java -Xmx14g -Dserver.port=8080 -Dspring.profiles.active=prod -Djdk.tls.useExtendedMasterSecret=false -jar /usr/build/project-web-1.3.0.war Thanks a lot in advance! Best regards, Søren On 9 May 2019, 17.59 +0200, Tilman Hausherr <[email protected]>, wrote: > please upload to a sharehoster and also mention what version you are using, > should be 2.0.15. > > Tilman > > > > ------------------------------------------------------------------------ > Gesendet mit der Telekom Mail App > <https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer> > > > --- Original-Nachricht --- > Von: Søren Pedersen > Betreff: Possible memory leak when extracting text? > Datum: 09.05.2019, 17:07 Uhr > An: [email protected] > > > Hi there > > We have an application that can index the contents of PDF files, so that we > can use that for a search algorithm. We use the Apache PDFBox library for > extracting text from a PDF, like this (where inputStream is a > ByteArrayInputStream containing the contents of the PDF file): > > PDFTextStripper pdfStripper = new PDFTextStripper(); > pdDoc = PDDocument.load(inputStream, > MemoryUsageSetting.setupTempFileOnly()); > String parsedText = pdfStripper.getText(pdDoc); > > We ran into a sample PDF file, that seems to cause a memory leak, as we get > an OutOfMemoryError: Java heap space. I have attached the file to this > email (not sure if that works on a mailing list?) > > Can someone try to extract the text in this PDF file, to confirm if there > is a memory leak, and maybe bring this to the attention of the developers? > > Thanks a lot in advance! > > Best regards, > Søren

