Stream XML input (or CSV if you can make that happen) works fine. If the file is local, you can do a curl that would normally upload a file via POST, but give this parameter: stream.file=/full/path/name.xml
Solr will read the file locally instead of through HTTP. On Wed, Apr 7, 2010 at 9:18 AM, Wawok, Brian <brian.wa...@cmegroup.com> wrote: > I don't think I want to stream from Java, text munging in Java is a PITA. > Would rather stream from a script, so need a more general solution. > > The Streaming document interface looks interesting, let me see if I can > figure out how to achieve the same thing without a Java client.. > > > Brian > > -----Original Message----- > From: Paolo Castagna [mailto:castagna.li...@googlemail.com] > Sent: Wednesday, April 07, 2010 11:11 AM > To: solr-user@lucene.apache.org > Subject: Re: solr best practice to submit many documents > > Hi Brian, > I had similar questions when I begun to try and evaluate Solr. > > If you use Java and SolrJ you might find these useful: > > - http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update > - > http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html > > I am also interested in knowing what is the best and more efficient way > to index a large number of documents. > > Paolo > > Wawok, Brian wrote: >> Hello, >> >> I am using SOLR for some proof of concept work, and was wondering if anyone >> has some guidance on a best practice. >> >> Background: >> Nightly get a delivery of a few 1000 reports. Each report is between 1 and >> 500,000 pages. >> For my proof of concept I am using a single 100,000 page report. >> I want to see how fast I can make SOLR handle this single report, and then >> can see how we can scale out to meet the total indexing demand (if needed). >> >> Trial 1: >> >> 1) Set up a solr server on server A with the default settings. Added a >> few new fields to index, including a full text index of the report. >> >> 2) Set up a simple Python script on serve B. It splits the report into >> 100,000 small documents, pulls out a few key fields to be sent along to >> index, and uses a python implementation of curl to shove the documents into >> the server (with 4 threads posting away). >> >> 3) After all 100,000 documents are posted, we post an index and let the >> server index. >> >> >> I was able to get this method to work, and it took around 340 seconds for >> the posting, and 10 seconds for the indexing. I am not sure if that indexing >> speed is a red hearing, and it was really doing a little bit of the indexing >> during the posts, or what. >> >> Regardless, it seems less than ideal to make 100,000 requests to the server >> to index 100,000 documents. Does anyone have an idea for how to make this >> process more efficient? Should I look into making an XML document with >> 100,000 documents enclosed? Or what will give me the best performance? Will >> this be much better than what I am seeing with my post method? I am not >> against writing a custom parser on the SOLR side, but if there is already a >> way in SOLR to send many documents efficiently, that is better. >> >> >> Thanks! >> >> Brian Wawok >> >> > -- Lance Norskog goks...@gmail.com