Hi Dan,

neat idea - made a mental note :-)

That brings us back to the point that in complex setups you should not do the document pre-processing directly in SOLR but have an import process which can safely crash when processing a 4GB PDF file

Cheers,

Siegfried Goeschl

On 16.01.15 05:02, Dan Davis wrote:
Why re-write all the document conversion in Java ;)  Tika is very slow.   5
GB PDF is very big.

If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output
mode.   The HTML mode captures some meta-data that would otherwise be lost.


If you need to go faster still, you can  also write some stuff linked
directly against poppler library.

Before you jump down by through about Tika being slow - I wrote a PDF
indexer that ran at 36 MB/s per core.   Different indexer, all C, lots of
getjmp/longjmp.   But fast...



On Thu, Jan 15, 2015 at 1:54 PM, <ganesh.ya...@sungard.com> wrote:

Siegfried and Michael Thank you for your replies and help.

-----Original Message-----
From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
Sent: Thursday, January 15, 2015 3:45 AM
To: solr-user@lucene.apache.org
Subject: Re: OutOfMemoryError for PDF document upload into Solr

Hi Ganesh,

you can increase the heap size but parsing a 4 GB PDF document will very
likely consume A LOT OF memory - I think you need to check if that large
PDF can be parsed at all :-)

Cheers,

Siegfried Goeschl

On 14.01.15 18:04, Michael Della Bitta wrote:
Yep, you'll have to increase the heap size for your Tomcat container.

http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial
-heap-size-correctly

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/11200277628550959
3336/posts>
w: appinions.com <http://www.appinions.com/>

On Wed, Jan 14, 2015 at 12:00 PM, <ganesh.ya...@sungard.com> wrote:

Hello,

Can someone pass on the hints to get around following error? Is there
any Heap Size parameter I can set in Tomcat or in Solr webApp that
gets deployed in Solr?

I am running Solr webapp inside Tomcat on my local machine which has
RAM of 12 GB. I have PDF document which is 4 GB max in size that
needs to be loaded into Solr




Exception in thread "http-apr-8983-exec-6" java.lang.    : Java heap
space
          at java.util.AbstractCollection.toArray(Unknown Source)
          at java.util.ArrayList.<init>(Unknown Source)
          at
org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
          at
org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)
          at
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)
          at
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
          at
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
          at
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
          at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
          at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
          at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
          at

org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
          at

org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
          at

org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
          at

org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
          at

org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
          at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
          at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
          at

org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
          at

org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
          at

org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
          at

org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
          at

org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
          at

org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
          at

org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
          at

org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
          at

org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:421)
          at

org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1070)
          at

org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:611)
          at

org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.doRun(AprEndpoint.java:2462)
          at
org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoin
t.java:2451)


Thanks
Ganesh







Reply via email to