The error might be that your http client doesn't handle really large
files (32-bit overflow in the Content-Length header?) or something in
your network is killing your long-lived socket?  Solr can definitely
accept a 13GB xml document.

I've uploaded large files into Solr successfully, including recently a
12GB XML input file with ~4 million documents.  My Solr instance had
2GB of memory and it took about 2 hours.  Solr streamed the XML in
nicely.  I had to jump through a couple of hoops, but in my case it
was easier than writing a tool to split up my 12GB XML file...

1. I tried to use curl to do the upload, but it didn't handle files
that large.  For my quick and dirty testing, netcat (nc) did the
trick--it doesn't buffer the file in memory and it doesn't overflow
the Content-Length header.  Plus I could pipe the data through pv to
get a progress bar and estimated time of completion.  Not recommended
for production!

  FILE=documents.xml
  SIZE=$(stat --format %s $FILE)
  (echo "POST /solr/update HTTP/1.1
  Host: localhost:8983
  Content-Type: text/xml
  Content-Length: $SIZE
  " ; cat $FILE ) | pv -s $SIZE | nc localhost 8983

2. Indexing seemed to use less memory if I configured Solr to auto
commit periodically in solrconfig.xml.  This is what I used:

    <updateHandler class="solr.DirectUpdateHandler2">
        <autoCommit>
            <maxDocs>25000</maxDocs> <!-- maximum uncommited docs
before autocommit triggered -->
            <maxTime>300000</maxTime> <!-- 5 minutes, maximum time (in
MS) after adding a doc before an autocommit is triggered -->
        </autoCommit>
    </updateHandler>

Shawn

On Thu, Apr 1, 2010 at 10:10 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> Don't do that. For many reasons <G>. By trying to batch so many docs
> together, you're just *asking* for trouble. Quite apart from whether it'll
> work once, having *any* HTTP-based protocol work reliably with 13G is
> fragile...
>
> For instance, I don't want to have my know whether the XML parsing in
> SOLR parses the entire document into memory before processing or
> not. But I sure don't want my application to change behavior if SOLR
> changes it's mind and wants to process the other way. My perfectly
> working application (assuming an event-driven parser) could
> suddenly start requiring over 13G of memory... Oh my aching head!
>
> Your specific error might even be dependent upon GCing, which will
> cause it to break differently, sometimes, maybe......
>
> So do break things up and transmit multiple documents. It'll save you
> a world of hurt.
>
> HTH
> Erick
>
> On Thu, Apr 1, 2010 at 4:34 AM, Mark Fletcher
> <mark.fletcher2...@gmail.com>wrote:
>
>> Hi,
>>
>> For the first time I tried uploading a huge input SOLR xml having about 1.2
>> million *docs* (13GB in size). After some time I get the following
>> exception:-
>>
>>  <u>The server encountered an internal error ([was class
>> java.net.SocketTimeoutException] Read timed out
>> java.lang.RuntimeException: [was class java.net.SocketTimeoutException]
>> Read
>> timed out
>>  at
>>
>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>>  at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
>>  at
>>
>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
>>  at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>>  at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:279)
>>  at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:138)
>>  at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
>>  at
>>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>  at
>>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>  at
>>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>  at
>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>  at
>>
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>>  at
>>
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>  at
>>
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>>  at
>>
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>>  at
>>
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>>  at
>>
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>>  at
>>
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>>  at
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
>>  at
>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845)
>>  at
>>
>> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
>>  at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
>>  at java.lang.Thread.run(Thread.java:619)
>> Caused by: java.net.SocketTimeoutException: Read timed out
>> ...
>>
>> Was the file I tried to upload too big and should I try reducing its
>> size..?
>>
>> Thanks and Rgds,
>> Mark.
>>
>

Reply via email to