Solr also has a feature to stream from a local file rather than over the network. The parameter stream.file=/full/local/file/name.txt means 'read this file from the local disk instead of the POST upload'. Of course, you have to get the entire file onto the Solr indexer machine (or a common file server).
http://wiki.apache.org/solr/UpdateRichDocuments#Parameters On Thu, Apr 1, 2010 at 9:27 PM, Mark Fletcher <mark.fletcher2...@gmail.com> wrote: > Hi Eric, Shawn, > > Thank you for your reply. > > Luckily just on the second time itself my 13GB SOLR XML (more than a million > docs) went in fine into SOLR without any problem and I uploaded another 2 > more sets of 1.2million+ docs fine without any hassle. > > I will try for lesser sized more xmls next time as well as the auto commit > suggestion. > > Best Rgds, > Mark. > > On Thu, Apr 1, 2010 at 6:18 PM, Shawn Smith <sh...@thena.net> wrote: > >> The error might be that your http client doesn't handle really large >> files (32-bit overflow in the Content-Length header?) or something in >> your network is killing your long-lived socket? Solr can definitely >> accept a 13GB xml document. >> >> I've uploaded large files into Solr successfully, including recently a >> 12GB XML input file with ~4 million documents. My Solr instance had >> 2GB of memory and it took about 2 hours. Solr streamed the XML in >> nicely. I had to jump through a couple of hoops, but in my case it >> was easier than writing a tool to split up my 12GB XML file... >> >> 1. I tried to use curl to do the upload, but it didn't handle files >> that large. For my quick and dirty testing, netcat (nc) did the >> trick--it doesn't buffer the file in memory and it doesn't overflow >> the Content-Length header. Plus I could pipe the data through pv to >> get a progress bar and estimated time of completion. Not recommended >> for production! >> >> FILE=documents.xml >> SIZE=$(stat --format %s $FILE) >> (echo "POST /solr/update HTTP/1.1 >> Host: localhost:8983 >> Content-Type: text/xml >> Content-Length: $SIZE >> " ; cat $FILE ) | pv -s $SIZE | nc localhost 8983 >> >> 2. Indexing seemed to use less memory if I configured Solr to auto >> commit periodically in solrconfig.xml. This is what I used: >> >> <updateHandler class="solr.DirectUpdateHandler2"> >> <autoCommit> >> <maxDocs>25000</maxDocs> <!-- maximum uncommited docs >> before autocommit triggered --> >> <maxTime>300000</maxTime> <!-- 5 minutes, maximum time (in >> MS) after adding a doc before an autocommit is triggered --> >> </autoCommit> >> </updateHandler> >> >> Shawn >> >> On Thu, Apr 1, 2010 at 10:10 AM, Erick Erickson <erickerick...@gmail.com> >> wrote: >> > Don't do that. For many reasons <G>. By trying to batch so many docs >> > together, you're just *asking* for trouble. Quite apart from whether >> it'll >> > work once, having *any* HTTP-based protocol work reliably with 13G is >> > fragile... >> > >> > For instance, I don't want to have my know whether the XML parsing in >> > SOLR parses the entire document into memory before processing or >> > not. But I sure don't want my application to change behavior if SOLR >> > changes it's mind and wants to process the other way. My perfectly >> > working application (assuming an event-driven parser) could >> > suddenly start requiring over 13G of memory... Oh my aching head! >> > >> > Your specific error might even be dependent upon GCing, which will >> > cause it to break differently, sometimes, maybe...... >> > >> > So do break things up and transmit multiple documents. It'll save you >> > a world of hurt. >> > >> > HTH >> > Erick >> > >> > On Thu, Apr 1, 2010 at 4:34 AM, Mark Fletcher >> > <mark.fletcher2...@gmail.com>wrote: >> > >> >> Hi, >> >> >> >> For the first time I tried uploading a huge input SOLR xml having about >> 1.2 >> >> million *docs* (13GB in size). After some time I get the following >> >> exception:- >> >> >> >> <u>The server encountered an internal error ([was class >> >> java.net.SocketTimeoutException] Read timed out >> >> java.lang.RuntimeException: [was class java.net.SocketTimeoutException] >> >> Read >> >> timed out >> >> at >> >> >> >> >> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) >> >> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) >> >> at >> >> >> >> >> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) >> >> at >> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) >> >> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:279) >> >> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:138) >> >> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) >> >> at >> >> >> >> >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) >> >> at >> >> >> >> >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) >> >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) >> >> at >> >> >> >> >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) >> >> at >> >> >> >> >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) >> >> at >> >> >> >> >> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) >> >> at >> >> >> >> >> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) >> >> at >> >> >> >> >> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) >> >> at >> >> >> >> >> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) >> >> at >> >> >> >> >> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) >> >> at >> >> >> >> >> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) >> >> at >> >> >> >> >> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) >> >> at >> >> >> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) >> >> at >> >> >> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845) >> >> at >> >> >> >> >> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) >> >> at >> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) >> >> at java.lang.Thread.run(Thread.java:619) >> >> Caused by: java.net.SocketTimeoutException: Read timed out >> >> ... >> >> >> >> Was the file I tried to upload too big and should I try reducing its >> >> size..? >> >> >> >> Thanks and Rgds, >> >> Mark. >> >> >> > >> > -- Lance Norskog goks...@gmail.com