Re: Possible memory leak when extracting text?

Søren Pedersen Wed, 15 May 2019 00:08:56 -0700

I have been trying to add the 2.0.16-SNAPSHOT version as a dependency to my 
application, but I keep having issues. I added this to my pom file:


<repositories>
   <repository>
       <id>repository.apache.org.snapshots</id>
       <name>Apache snapshots repo</name>
       <url>https://repository.apache.org/content/groups/snapshots/</url>
       <snapshots>
           <enabled>true</enabled>
       </snapshots>
       <releases>
           <enabled>false</enabled>
       </releases>
   </repository>
</repositories>

And then I added this under dependencies:

<dependency>
   <groupId>org.apache.pdfbox</groupId>
   <artifactId>pdfbox</artifactId>
   <version>2.0.16-SNAPSHOT</version>
</dependency>

When I run “mvn compile” I get this error:

[ERROR] Failed to execute goal on project pdftextextractor: Could not resolve 
dependencies for project nu.optimise:pdftextextractor:jar:1.0: Failed to 
collect dependencies at org.apache.pdfbox:pdfbox:jar:2.0.16-SNAPSHOT: Failed to 
read artifact descriptor for org.apache.pdfbox:pdfbox:jar:2.0.16-SNAPSHOT: 
Could not find artifact 
org.apache.pdfbox:pdfbox-parent:pom:2.0.16-20190513.180308-43 in 
repository.apache.org.snapshots 
(https://repository.apache.org/content/groups/snapshots/) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException


I am probably missing something obvious, but I haven’t been working with Java 
for that long, so I have no clue what to do (my googling skills did not 
prevail).

Do you have any tips?

Thanks a lot in advance!

Best regards,
Søren


On 11 May 2019, 11.04 +0200, Tilman Hausherr <[email protected]>, wrote:
> The reason I mentioned 2.0.16 is because of this bug:
> https://issues.apache.org/jira/browse/PDFBOX-4489
>
> that one happened with a corrupt file. Yours isn't, but it might be if
> it gets corrupted in transfer or in filtering.
>
> 2.0.16 snapshot is here:
> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.16-SNAPSHOT/
>
> Tilman
>
> Am 11.05.2019 um 06:54 schrieb Søren Pedersen:
> > Ok, that is very interesting. Thanks a lot for looking into this!
> >
> > I am a bit baffled as to why we experience the memory leak then, but I 
> > guess I will have to dig more into it.
> >
> > Best regards,
> > Søren
> > On 10 May 2019, 18.30 +0200, Andreas Lehmkuehler <[email protected]>, wrote:
> > > Am 10.05.19 um 15:52 schrieb Søren Pedersen:
> > > > I have done some more testing, and I found that when I run on Windows 
> > > > there are no problems, but when I run on Linux I get the memory leak. 
> > > > Tilman, would you be able to run the same test on a Linux box? - or 
> > > > maybe using a Linux Docker container, like I showed originally?
> > > I've extracted the text on linux (fedora 30, openjdk 1.8.0_212) without 
> > > any
> > > problems using
> > >
> > > java -Xmx9m -jar pdfbox-app-2.0.15.jar ExtractText
> > >
> > > where -Xmx9m is the smallest working value
> > >
> > > Andreas
> > >
> > > > We would prefer to run our app on Linux, but this looks like a blocker 
> > > > for that unfortunately :(
> > > >
> > > > Best regards,
> > > > Søren Pedersen
> > > > On 10 May 2019, 09.32 +0200, Søren Pedersen <[email protected]>, 
> > > > wrote:
> > > > > Ok, thanks a lot for looking into this Tilman. I will try your 
> > > > > suggestion and keep fiddling with it :)
> > > > >
> > > > > Have a great weekend!
> > > > > On 10 May 2019, 08.12 +0200, Tilman Hausherr <[email protected]>, 
> > > > > wrote:
> > > > > > Am 10.05.2019 um 07:22 schrieb Søren Pedersen:
> > > > > > > We have an application that can index the contents of PDF files, 
> > > > > > > so that we
> > > > > > > can use that for a search algorithm. We use the Apache PDFBox 
> > > > > > > library for
> > > > > > > extracting text from a PDF, like this (where inputStream is a
> > > > > > > ByteArrayInputStream containing the contents of the PDF file):
> > > > > > >
> > > > > > > PDFTextStripper pdfStripper = new PDFTextStripper();
> > > > > > > pdDoc = PDDocument.load(inputStream,
> > > > > > > MemoryUsageSetting.setupTempFileOnly());
> > > > > > > String parsedText = pdfStripper.getText(pdDoc);
> > > > > >
> > > > > > You can pass the byte[] directly to load(). Also make sure that the
> > > > > > bytes are not altered in any way, e.g. through a incorrectly 
> > > > > > configured
> > > > > > web downloading, or an incorrectly configured resource loading
> > > > > > ("filtering" option must be false).
> > > > > >
> > > > > >
> > > > > > Also retry with 2.0.16 snapshot.
> > > > > >
> > > > > > Tilman
> > > > > >
> > > > > >
> > > > > >
> > > > > > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > For additional commands, e-mail: [email protected]
> > > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

Re: Possible memory leak when extracting text?

Reply via email to