I don't think I want to stream from Java, text munging in Java is a PITA. Would rather stream from a script, so need a more general solution.
The Streaming document interface looks interesting, let me see if I can figure out how to achieve the same thing without a Java client.. Brian -----Original Message----- From: Paolo Castagna [mailto:castagna.li...@googlemail.com] Sent: Wednesday, April 07, 2010 11:11 AM To: solr-user@lucene.apache.org Subject: Re: solr best practice to submit many documents Hi Brian, I had similar questions when I begun to try and evaluate Solr. If you use Java and SolrJ you might find these useful: - http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update - http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html I am also interested in knowing what is the best and more efficient way to index a large number of documents. Paolo Wawok, Brian wrote: > Hello, > > I am using SOLR for some proof of concept work, and was wondering if anyone > has some guidance on a best practice. > > Background: > Nightly get a delivery of a few 1000 reports. Each report is between 1 and > 500,000 pages. > For my proof of concept I am using a single 100,000 page report. > I want to see how fast I can make SOLR handle this single report, and then > can see how we can scale out to meet the total indexing demand (if needed). > > Trial 1: > > 1) Set up a solr server on server A with the default settings. Added a > few new fields to index, including a full text index of the report. > > 2) Set up a simple Python script on serve B. It splits the report into > 100,000 small documents, pulls out a few key fields to be sent along to > index, and uses a python implementation of curl to shove the documents into > the server (with 4 threads posting away). > > 3) After all 100,000 documents are posted, we post an index and let the > server index. > > > I was able to get this method to work, and it took around 340 seconds for the > posting, and 10 seconds for the indexing. I am not sure if that indexing > speed is a red hearing, and it was really doing a little bit of the indexing > during the posts, or what. > > Regardless, it seems less than ideal to make 100,000 requests to the server > to index 100,000 documents. Does anyone have an idea for how to make this > process more efficient? Should I look into making an XML document with > 100,000 documents enclosed? Or what will give me the best performance? Will > this be much better than what I am seeing with my post method? I am not > against writing a custom parser on the SOLR side, but if there is already a > way in SOLR to send many documents efficiently, that is better. > > > Thanks! > > Brian Wawok > >