I've seen best throughput while indexing by sending in batches of documents 
rather than individual documents per request. You might try queueing on your 
indexing machines for a bit then sending off a batch every N documents.

Thanks,
Greg

On Feb 1, 2014, at 6:49 PM, Software Dev <static.void....@gmail.com> wrote:

> Also, if we are seeing a huge cpu spike on the leader when doing a bulk
> index, would changing any of the options help?
> 
> 
> On Sat, Feb 1, 2014 at 2:59 PM, Software Dev <static.void....@gmail.com>wrote:
> 
>> Out use case is we have 3 indexing machines pulling off a kafka queue and
>> they are all sending individual updates.
>> 
>> 
>> On Fri, Jan 31, 2014 at 12:54 PM, Mark Miller <markrmil...@gmail.com>wrote:
>> 
>>> Just make sure parallel updates is set to true.
>>> 
>>> If you want to load even faster, you can use the bulk add methods, or if
>>> you need more fine grained responses, use the single add from multiple
>>> threads (though bulk add can also be done via multiple threads if you
>>> really want to try and push the max).
>>> 
>>> - Mark
>>> 
>>> http://about.me/markrmiller
>>> 
>>> On Jan 31, 2014, at 3:50 PM, Software Dev <static.void....@gmail.com>
>>> wrote:
>>> 
>>>> Which of any of these settings would be beneficial when bulk uploading?
>>>> 
>>>> 
>>>> On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller <markrmil...@gmail.com>
>>> wrote:
>>>> 
>>>>> 
>>>>> 
>>>>> On Jan 31, 2014, at 1:56 PM, Greg Walters <greg.walt...@answers.com>
>>>>> wrote:
>>>>> 
>>>>>> I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore
>>>>> my response.
>>>>>> 
>>>>>>> -updatesToLeaders
>>>>>> 
>>>>>> Only send documents to shard leaders while indexing. This saves
>>>>> cross-talk between slaves and leaders which results in more efficient
>>>>> document routing.
>>>>> 
>>>>> Right, but recently this has less of an affect because CloudSolrServer
>>> can
>>>>> now hash documents and directly send them to the right place. This
>>> option
>>>>> has become more historical. Just make sure you set the correct id
>>> field on
>>>>> the CloudSolrServer instance for this hashing to work (I think it
>>> defaults
>>>>> to "id").
>>>>> 
>>>>>> 
>>>>>>> shutdownLBHttpSolrServer
>>>>>> 
>>>>>> CloudSolrServer uses a LBHttpSolrServer behind the scenes to
>>> distribute
>>>>> requests (that aren't updates directly to leaders). Where did you find
>>>>> this? I don't see this in the javadoc anywhere but it is a boolean in
>>> the
>>>>> CloudSolrServer class. It looks like when you create a new
>>> CloudSolrServer
>>>>> and pass it your own LBHttpSolrServer the boolean gets set to false
>>> and the
>>>>> CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut
>>> down.
>>>>>> 
>>>>>>> parellelUpdates
>>>>>> 
>>>>>> The javadoc's done have any description for this one but I checked out
>>>>> the code for CloudSolrServer and if parallelUpdates it looks like it
>>>>> executes update statements to multiple shards at the same time.
>>>>> 
>>>>> Right, we should def add some javadoc, but this sends updates to
>>> shards in
>>>>> parallel rather than with a single thread. Can really increase update
>>>>> speed. Still not as powerful as using CloudSolrServer from multiple
>>>>> threads, but a nice improvement non the less.
>>>>> 
>>>>> 
>>>>> - Mark
>>>>> 
>>>>> http://about.me/markrmiller
>>>>> 
>>>>>> 
>>>>>> I'm no dev but I can read so please excuse any errors on my part.
>>>>>> 
>>>>>> Thanks,
>>>>>> Greg
>>>>>> 
>>>>>> On Jan 31, 2014, at 11:40 AM, Software Dev <static.void....@gmail.com
>>>> 
>>>>> wrote:
>>>>>> 
>>>>>>> Can someone clarify what the following options are:
>>>>>>> 
>>>>>>> - updatesToLeaders
>>>>>>> - shutdownLBHttpSolrServer
>>>>>>> - parallelUpdates
>>>>>>> 
>>>>>>> Also, I remember in older version of Solr there was an efficient
>>> format
>>>>>>> that was used between SolrJ and Solr that is more compact. Does this
>>>>> sill
>>>>>>> exist in the latest version of Solr? If so, is it the default?
>>>>>>> 
>>>>>>> Thanks
>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>> 

Reply via email to