this is some problem with the numbers of components being different. Try pdfbox-app instead. Tilman
------------------------------------------------------------------------ Gesendet mit der Telekom Mail App <https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer> --- Original-Nachricht --- Von: Søren Pedersen Betreff: Re: Possible memory leak when extracting text? Datum: 15.05.2019, 9:08 Uhr An: [email protected] I have been trying to add the 2.0.16-SNAPSHOT version as a dependency to my application, but I keep having issues. I added this to my pom file: <repositories> <repository> <id>repository.apache.org.snapshots</id> <http://repository.apache.org.snapshots</id>> ; <name>Apache snapshots repo</name> <url>https://repository.apache.org/content/groups/snapshots/</url> <https://repository.apache.org/content/groups/snapshots/</url>> ; <snapshots> <enabled>true</enabled> </snapshots> <releases> <enabled>false</enabled> </releases> </repository> </repositories> And then I added this under dependencies: <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>2.0.16-SNAPSHOT</version> </dependency> When I run “mvn compile” I get this error: [ERROR] Failed to execute goal on project pdftextextractor: Could not resolve dependencies for project nu.optimise:pdftextextractor:jar:1.0: Failed to collect dependencies at org.apache.pdfbox:pdfbox:jar:2.0.16-SNAPSHOT: Failed to read artifact descriptor for org.apache.pdfbox:pdfbox:jar:2.0.16-SNAPSHOT: Could not find artifact org.apache.pdfbox:pdfbox-parent:pom:2.0.16-20190513.180308-43 in repository.apache.org.snapshots <http://repository.apache.org.snapshots> ( https://repository.apache.org/content/groups/snapshots <https://repository.apache.org/content/groups/snapshots> /) -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException <http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException> I am probably missing something obvious, but I haven’t been working with Java for that long, so I have no clue what to do (my googling skills did not prevail). Do you have any tips? Thanks a lot in advance! Best regards, Søren On 11 May 2019, 11.04 +0200, Tilman Hausherr <[email protected] <mailto:[email protected]> >, wrote: > The reason I mentioned 2.0.16 is because of this bug: > https://issues.apache.org/jira/browse/PDFBOX-4489 <https://issues.apache.org/jira/browse/PDFBOX-4489> > > that one happened with a corrupt file. Yours isn't, but it might be if > it gets corrupted in transfer or in filtering. > > 2.0.16 snapshot is here: > https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.16-SNAPSHOT <https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.16-SNAPSHOT> / > > Tilman > > Am 11.05.2019 um 06:54 schrieb Søren Pedersen: > > Ok, that is very interesting. Thanks a lot for looking into this! > > > > I am a bit baffled as to why we experience the memory leak then, but I guess I will have to dig more into it. > > > > Best regards, > > Søren > > On 10 May 2019, 18.30 +0200, Andreas Lehmkuehler <[email protected] <mailto:[email protected]> >, wrote: > > > Am 10.05.19 um 15:52 schrieb Søren Pedersen: > > > > I have done some more testing, and I found that when I run on Windows there are no problems, but when I run on Linux I get the memory leak. Tilman, would you be able to run the same test on a Linux box? - or maybe using a Linux Docker container, like I showed originally? > > > I've extracted the text on linux (fedora 30, openjdk 1.8.0_212) without any > > > problems using > > > > > > java -Xmx9m -jar pdfbox-app-2.0.15.jar ExtractText > > > > > > where -Xmx9m is the smallest working value > > > > > > Andreas > > > > > > > We would prefer to run our app on Linux, but this looks like a blocker for that unfortunately :( > > > > > > > > Best regards, > > > > Søren Pedersen > > > > On 10 May 2019, 09.32 +0200, Søren Pedersen <[email protected] <mailto:[email protected]> >, wrote: > > > > > Ok, thanks a lot for looking into this Tilman. I will try your suggestion and keep fiddling with it :) > > > > > > > > > > Have a great weekend! > > > > > On 10 May 2019, 08.12 +0200, Tilman Hausherr < [email protected] <mailto:[email protected]> >, wrote: > > > > > > Am 10.05.2019 um 07:22 schrieb Søren Pedersen: > > > > > > > We have an application that can index the contents of PDF files, so that we > > > > > > > can use that for a search algorithm. We use the Apache PDFBox library for > > > > > > > extracting text from a PDF, like this (where inputStream is a > > > > > > > ByteArrayInputStream containing the contents of the PDF file): > > > > > > > > > > > > > > PDFTextStripper pdfStripper = new PDFTextStripper(); > > > > > > > pdDoc = PDDocument.load(inputStream, > > > > > > > MemoryUsageSetting.setupTempFileOnly <http://MemoryUsageSetting.setupTempFileOnly> ()); > > > > > > > String parsedText = pdfStripper.getText(pdDoc <http://pdfStripper.getText(pdDoc> ); > > > > > > > > > > > > You can pass the byte[] directly to load(). Also make sure that the > > > > > > bytes are not altered in any way, e.g. through a incorrectly configured > > > > > > web downloading, or an incorrectly configured resource loading > > > > > > ("filtering" option must be false). > > > > > > > > > > > > > > > > > > Also retry with 2.0.16 snapshot. > > > > > > > > > > > > Tilman > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: [email protected] <mailto:[email protected]> > > > > > > For additional commands, e-mail: [email protected] <mailto:[email protected]> > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [email protected] <mailto:[email protected]> > > > For additional commands, e-mail: [email protected] <mailto:[email protected]> > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] <mailto:[email protected]> > For additional commands, e-mail: [email protected] <mailto:[email protected]> >

