I don't think I want to stream from Java, text munging in Java is a PITA. Would 
rather stream from a script, so need a more general solution.

The Streaming document interface looks interesting, let me see if I can figure 
out how to achieve the same thing without a Java client..


Brian

-----Original Message-----
From: Paolo Castagna [mailto:castagna.li...@googlemail.com] 
Sent: Wednesday, April 07, 2010 11:11 AM
To: solr-user@lucene.apache.org
Subject: Re: solr best practice to submit many documents

Hi Brian,
I had similar questions when I begun to try and evaluate Solr.

If you use Java and SolrJ you might find these useful:

  - http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update
  - 
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html

I am also interested in knowing what is the best and more efficient way
to index a large number of documents.

Paolo

Wawok, Brian wrote:
> Hello,
> 
> I am using SOLR for some proof of concept work, and was wondering if anyone 
> has some guidance on a best practice.
> 
> Background:
> Nightly get a delivery of  a few 1000 reports. Each report is between 1 and 
> 500,000 pages.
> For my proof of concept I am using a single 100,000 page report.
> I want to see how fast I can make SOLR handle this single report, and then 
> can see how we can scale out to meet the total indexing demand (if needed).
> 
> Trial 1:
> 
> 1)      Set up a solr server on server A with the default settings. Added a 
> few new fields to index, including a full text index of the report.
> 
> 2)      Set up a simple Python script on serve B. It splits the report into 
> 100,000 small documents, pulls out a few key fields to be sent along to 
> index, and uses a python implementation of curl to shove the documents into 
> the server (with 4 threads posting away).
> 
> 3)      After all 100,000 documents are posted, we post an index and let the 
> server index.
> 
> 
> I was able to get this method to work, and it took around 340 seconds for the 
> posting, and 10 seconds for the indexing. I am not sure if that indexing 
> speed is a red hearing, and it was really doing a little bit of the indexing 
> during the posts, or what.
> 
> Regardless, it seems less than ideal to make 100,000 requests to the server 
> to index 100,000 documents.  Does anyone have an idea for how to make this 
> process more efficient? Should I look into making an XML document with 
> 100,000 documents enclosed? Or what will give me the best performance?  Will 
> this be much better than what I am seeing with my post method?  I am not 
> against writing a custom parser on the SOLR side, but if there is already a 
> way in SOLR to send many documents efficiently,  that is better.
> 
> 
> Thanks!
> 
> Brian Wawok
> 
> 

Reply via email to