RE: Indexing very large files.

Jon Lehto Sat, 23 Feb 2008 07:01:44 -0800

Dave

You may want to break large docs into chunks, say by chapter or other
logical segment.


This will help in 
 - relevance ranking - the term frequency of large docs will cause
   uneven weighting unless the relevance calculation does log normalization
 - finer granularity of retrieval - for example a dictionary, thesaurus, and
   Encyclopedia probably have what you want, but how to get it quickly?
 - post-processing - like high-lighting, can be a performance killer, as the
   search/replace scans the entire large file for matching strings

Jon

-----Original Message-----
From: David Thibault [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 21, 2008 7:58 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing very large files.

All,
A while back I was running into an issue with a Java heap out of memory
error while indexing large files.  I figured out that was my own error due
to a misconfiguration of my Netbeans memory settings.

However, now that is fixed and I have stumbled upon a new error.  When
trying to upload files which include a Solr TextField value of 32MB or more
in size, I get the following error (uploading with SimplePostTool):


Solr returned an error: error reading input, returned 0
javax.xml.stream.XMLStreamException: error reading input, returned 0  at
com.bea.xml.stream.MXParser.fillBuf(MXParser.java:3709)  at
com.bea.xml.stream.MXParser.more(MXParser.java:3715)  at
com.bea.xml.stream.MXParser.nextImpl(MXParser.java:1936)  at
com.bea.xml.stream.MXParser.next(MXParser.java:1333)  at
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(
XmlUpdateRequestHandler.java:318)  at
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(
XmlUpdateRequestHandler.java:195)  at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(
XmlUpdateRequestHandler.java:123)  at
org.apache.solr.handler.RequestHandlerBase.handleRequest(
RequestHandlerBase.java:117)  at org.apache.solr.core.SolrCore.execute(
SolrCore.java:902)  at org.apache.solr.servlet.SolrDispatchFilter.execute(
SolrDispatchFilter.java:280)  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
237)
 at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
ApplicationFilterChain.java:235)  at
org.apache.catalina.core.ApplicationFilterChain.doFilter(
ApplicationFilterChain.java:206)  at
org.apache.catalina.core.StandardWrapperValve.invoke(
StandardWrapperValve.java:233)  at
org.apache.catalina.core.StandardContextValve.invoke(
StandardContextValve.java:175)  at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128
)
 at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102
)
 at org.apache.catalina.core.StandardEngineValve.invoke(
StandardEngineValve.java:109)  at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
 at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
 at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(
Http11Protocol.java:583)  at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)  at
java.lang.Thread.run(Thread.java:613)

I suspect there's a setting somewhere that I'm overlooking that is causing
this, but after peering through the solrconfig.xml and schema.xml files I am
not seeing anything obvious (to me, anyway...=).  The second line of the
error shows it's crashing in MXParser.fillBuf, which implies that I'm
overloading the buffer (I assume due to too large of a string).

Thanks in advance for any assistance,
Dave

RE: Indexing very large files.

Reply via email to