Solr also has a feature to stream from a local file rather than over
the network. The parameter
    stream.file=/full/local/file/name.txt
means 'read this file from the local disk instead of the POST upload'.
Of course, you have to get the entire file onto the Solr indexer
machine (or a common file server).

http://wiki.apache.org/solr/UpdateRichDocuments#Parameters

On Thu, Apr 1, 2010 at 9:27 PM, Mark Fletcher
<mark.fletcher2...@gmail.com> wrote:
> Hi Eric, Shawn,
>
> Thank you for your reply.
>
> Luckily just on the second time itself my 13GB SOLR XML (more than a million
> docs) went in fine into SOLR without any problem and I uploaded another 2
> more sets of 1.2million+ docs fine without any hassle.
>
> I will try for lesser sized more xmls next time as well as the auto commit
> suggestion.
>
> Best Rgds,
> Mark.
>
> On Thu, Apr 1, 2010 at 6:18 PM, Shawn Smith <sh...@thena.net> wrote:
>
>> The error might be that your http client doesn't handle really large
>> files (32-bit overflow in the Content-Length header?) or something in
>> your network is killing your long-lived socket?  Solr can definitely
>> accept a 13GB xml document.
>>
>> I've uploaded large files into Solr successfully, including recently a
>> 12GB XML input file with ~4 million documents.  My Solr instance had
>> 2GB of memory and it took about 2 hours.  Solr streamed the XML in
>> nicely.  I had to jump through a couple of hoops, but in my case it
>> was easier than writing a tool to split up my 12GB XML file...
>>
>> 1. I tried to use curl to do the upload, but it didn't handle files
>> that large.  For my quick and dirty testing, netcat (nc) did the
>> trick--it doesn't buffer the file in memory and it doesn't overflow
>> the Content-Length header.  Plus I could pipe the data through pv to
>> get a progress bar and estimated time of completion.  Not recommended
>> for production!
>>
>>  FILE=documents.xml
>>  SIZE=$(stat --format %s $FILE)
>>  (echo "POST /solr/update HTTP/1.1
>>  Host: localhost:8983
>>  Content-Type: text/xml
>>  Content-Length: $SIZE
>>  " ; cat $FILE ) | pv -s $SIZE | nc localhost 8983
>>
>> 2. Indexing seemed to use less memory if I configured Solr to auto
>> commit periodically in solrconfig.xml.  This is what I used:
>>
>>    <updateHandler class="solr.DirectUpdateHandler2">
>>        <autoCommit>
>>            <maxDocs>25000</maxDocs> <!-- maximum uncommited docs
>> before autocommit triggered -->
>>            <maxTime>300000</maxTime> <!-- 5 minutes, maximum time (in
>> MS) after adding a doc before an autocommit is triggered -->
>>        </autoCommit>
>>    </updateHandler>
>>
>> Shawn
>>
>> On Thu, Apr 1, 2010 at 10:10 AM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> > Don't do that. For many reasons <G>. By trying to batch so many docs
>> > together, you're just *asking* for trouble. Quite apart from whether
>> it'll
>> > work once, having *any* HTTP-based protocol work reliably with 13G is
>> > fragile...
>> >
>> > For instance, I don't want to have my know whether the XML parsing in
>> > SOLR parses the entire document into memory before processing or
>> > not. But I sure don't want my application to change behavior if SOLR
>> > changes it's mind and wants to process the other way. My perfectly
>> > working application (assuming an event-driven parser) could
>> > suddenly start requiring over 13G of memory... Oh my aching head!
>> >
>> > Your specific error might even be dependent upon GCing, which will
>> > cause it to break differently, sometimes, maybe......
>> >
>> > So do break things up and transmit multiple documents. It'll save you
>> > a world of hurt.
>> >
>> > HTH
>> > Erick
>> >
>> > On Thu, Apr 1, 2010 at 4:34 AM, Mark Fletcher
>> > <mark.fletcher2...@gmail.com>wrote:
>> >
>> >> Hi,
>> >>
>> >> For the first time I tried uploading a huge input SOLR xml having about
>> 1.2
>> >> million *docs* (13GB in size). After some time I get the following
>> >> exception:-
>> >>
>> >>  <u>The server encountered an internal error ([was class
>> >> java.net.SocketTimeoutException] Read timed out
>> >> java.lang.RuntimeException: [was class java.net.SocketTimeoutException]
>> >> Read
>> >> timed out
>> >>  at
>> >>
>> >>
>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>> >>  at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
>> >>  at
>> >>
>> >>
>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
>> >>  at
>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>> >>  at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:279)
>> >>  at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:138)
>> >>  at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
>> >>  at
>> >>
>> >>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>> >>  at
>> >>
>> >>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>> >>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>> >>  at
>> >>
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>> >>  at
>> >>
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>> >>  at
>> >>
>> >>
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>> >>  at
>> >>
>> >>
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>> >>  at
>> >>
>> >>
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>> >>  at
>> >>
>> >>
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>> >>  at
>> >>
>> >>
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>> >>  at
>> >>
>> >>
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>> >>  at
>> >>
>> >>
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>> >>  at
>> >>
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
>> >>  at
>> >>
>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845)
>> >>  at
>> >>
>> >>
>> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
>> >>  at
>> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
>> >>  at java.lang.Thread.run(Thread.java:619)
>> >> Caused by: java.net.SocketTimeoutException: Read timed out
>> >> ...
>> >>
>> >> Was the file I tried to upload too big and should I try reducing its
>> >> size..?
>> >>
>> >> Thanks and Rgds,
>> >> Mark.
>> >>
>> >
>>
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to