Re: solr best practice to submit many documents

Lance Norskog Wed, 07 Apr 2010 18:50:54 -0700

Stream XML input (or CSV if you can make that happen) works fine. If
the file is local, you can do a curl that would normally upload a file
via POST, but give this parameter: stream.file=/full/path/name.xml


Solr will read the file locally instead of through HTTP.

On Wed, Apr 7, 2010 at 9:18 AM, Wawok, Brian <brian.wa...@cmegroup.com> wrote:
> I don't think I want to stream from Java, text munging in Java is a PITA. 
> Would rather stream from a script, so need a more general solution.
>
> The Streaming document interface looks interesting, let me see if I can 
> figure out how to achieve the same thing without a Java client..
>
>
> Brian
>
> -----Original Message-----
> From: Paolo Castagna [mailto:castagna.li...@googlemail.com]
> Sent: Wednesday, April 07, 2010 11:11 AM
> To: solr-user@lucene.apache.org
> Subject: Re: solr best practice to submit many documents
>
> Hi Brian,
> I had similar questions when I begun to try and evaluate Solr.
>
> If you use Java and SolrJ you might find these useful:
>
>  - http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update
>  -
> http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html
>
> I am also interested in knowing what is the best and more efficient way
> to index a large number of documents.
>
> Paolo
>
> Wawok, Brian wrote:
>> Hello,
>>
>> I am using SOLR for some proof of concept work, and was wondering if anyone 
>> has some guidance on a best practice.
>>
>> Background:
>> Nightly get a delivery of  a few 1000 reports. Each report is between 1 and 
>> 500,000 pages.
>> For my proof of concept I am using a single 100,000 page report.
>> I want to see how fast I can make SOLR handle this single report, and then 
>> can see how we can scale out to meet the total indexing demand (if needed).
>>
>> Trial 1:
>>
>> 1)      Set up a solr server on server A with the default settings. Added a 
>> few new fields to index, including a full text index of the report.
>>
>> 2)      Set up a simple Python script on serve B. It splits the report into 
>> 100,000 small documents, pulls out a few key fields to be sent along to 
>> index, and uses a python implementation of curl to shove the documents into 
>> the server (with 4 threads posting away).
>>
>> 3)      After all 100,000 documents are posted, we post an index and let the 
>> server index.
>>
>>
>> I was able to get this method to work, and it took around 340 seconds for 
>> the posting, and 10 seconds for the indexing. I am not sure if that indexing 
>> speed is a red hearing, and it was really doing a little bit of the indexing 
>> during the posts, or what.
>>
>> Regardless, it seems less than ideal to make 100,000 requests to the server 
>> to index 100,000 documents.  Does anyone have an idea for how to make this 
>> process more efficient? Should I look into making an XML document with 
>> 100,000 documents enclosed? Or what will give me the best performance?  Will 
>> this be much better than what I am seeing with my post method?  I am not 
>> against writing a custom parser on the SOLR side, but if there is already a 
>> way in SOLR to send many documents efficiently,  that is better.
>>
>>
>> Thanks!
>>
>> Brian Wawok
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com

Re: solr best practice to submit many documents

Reply via email to