The error might be that your http client doesn't handle really large files (32-bit overflow in the Content-Length header?) or something in your network is killing your long-lived socket? Solr can definitely accept a 13GB xml document.
I've uploaded large files into Solr successfully, including recently a 12GB XML input file with ~4 million documents. My Solr instance had 2GB of memory and it took about 2 hours. Solr streamed the XML in nicely. I had to jump through a couple of hoops, but in my case it was easier than writing a tool to split up my 12GB XML file... 1. I tried to use curl to do the upload, but it didn't handle files that large. For my quick and dirty testing, netcat (nc) did the trick--it doesn't buffer the file in memory and it doesn't overflow the Content-Length header. Plus I could pipe the data through pv to get a progress bar and estimated time of completion. Not recommended for production! FILE=documents.xml SIZE=$(stat --format %s $FILE) (echo "POST /solr/update HTTP/1.1 Host: localhost:8983 Content-Type: text/xml Content-Length: $SIZE " ; cat $FILE ) | pv -s $SIZE | nc localhost 8983 2. Indexing seemed to use less memory if I configured Solr to auto commit periodically in solrconfig.xml. This is what I used: <updateHandler class="solr.DirectUpdateHandler2"> <autoCommit> <maxDocs>25000</maxDocs> <!-- maximum uncommited docs before autocommit triggered --> <maxTime>300000</maxTime> <!-- 5 minutes, maximum time (in MS) after adding a doc before an autocommit is triggered --> </autoCommit> </updateHandler> Shawn On Thu, Apr 1, 2010 at 10:10 AM, Erick Erickson <erickerick...@gmail.com> wrote: > Don't do that. For many reasons <G>. By trying to batch so many docs > together, you're just *asking* for trouble. Quite apart from whether it'll > work once, having *any* HTTP-based protocol work reliably with 13G is > fragile... > > For instance, I don't want to have my know whether the XML parsing in > SOLR parses the entire document into memory before processing or > not. But I sure don't want my application to change behavior if SOLR > changes it's mind and wants to process the other way. My perfectly > working application (assuming an event-driven parser) could > suddenly start requiring over 13G of memory... Oh my aching head! > > Your specific error might even be dependent upon GCing, which will > cause it to break differently, sometimes, maybe...... > > So do break things up and transmit multiple documents. It'll save you > a world of hurt. > > HTH > Erick > > On Thu, Apr 1, 2010 at 4:34 AM, Mark Fletcher > <mark.fletcher2...@gmail.com>wrote: > >> Hi, >> >> For the first time I tried uploading a huge input SOLR xml having about 1.2 >> million *docs* (13GB in size). After some time I get the following >> exception:- >> >> <u>The server encountered an internal error ([was class >> java.net.SocketTimeoutException] Read timed out >> java.lang.RuntimeException: [was class java.net.SocketTimeoutException] >> Read >> timed out >> at >> >> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) >> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) >> at >> >> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) >> at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) >> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:279) >> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:138) >> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) >> at >> >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) >> at >> >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) >> at >> >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) >> at >> >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) >> at >> >> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) >> at >> >> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) >> at >> >> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) >> at >> >> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) >> at >> >> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) >> at >> >> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) >> at >> >> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) >> at >> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) >> at >> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845) >> at >> >> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) >> at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) >> at java.lang.Thread.run(Thread.java:619) >> Caused by: java.net.SocketTimeoutException: Read timed out >> ... >> >> Was the file I tried to upload too big and should I try reducing its >> size..? >> >> Thanks and Rgds, >> Mark. >> >