I've seen best throughput while indexing by sending in batches of documents rather than individual documents per request. You might try queueing on your indexing machines for a bit then sending off a batch every N documents.
Thanks, Greg On Feb 1, 2014, at 6:49 PM, Software Dev <static.void....@gmail.com> wrote: > Also, if we are seeing a huge cpu spike on the leader when doing a bulk > index, would changing any of the options help? > > > On Sat, Feb 1, 2014 at 2:59 PM, Software Dev <static.void....@gmail.com>wrote: > >> Out use case is we have 3 indexing machines pulling off a kafka queue and >> they are all sending individual updates. >> >> >> On Fri, Jan 31, 2014 at 12:54 PM, Mark Miller <markrmil...@gmail.com>wrote: >> >>> Just make sure parallel updates is set to true. >>> >>> If you want to load even faster, you can use the bulk add methods, or if >>> you need more fine grained responses, use the single add from multiple >>> threads (though bulk add can also be done via multiple threads if you >>> really want to try and push the max). >>> >>> - Mark >>> >>> http://about.me/markrmiller >>> >>> On Jan 31, 2014, at 3:50 PM, Software Dev <static.void....@gmail.com> >>> wrote: >>> >>>> Which of any of these settings would be beneficial when bulk uploading? >>>> >>>> >>>> On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller <markrmil...@gmail.com> >>> wrote: >>>> >>>>> >>>>> >>>>> On Jan 31, 2014, at 1:56 PM, Greg Walters <greg.walt...@answers.com> >>>>> wrote: >>>>> >>>>>> I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore >>>>> my response. >>>>>> >>>>>>> -updatesToLeaders >>>>>> >>>>>> Only send documents to shard leaders while indexing. This saves >>>>> cross-talk between slaves and leaders which results in more efficient >>>>> document routing. >>>>> >>>>> Right, but recently this has less of an affect because CloudSolrServer >>> can >>>>> now hash documents and directly send them to the right place. This >>> option >>>>> has become more historical. Just make sure you set the correct id >>> field on >>>>> the CloudSolrServer instance for this hashing to work (I think it >>> defaults >>>>> to "id"). >>>>> >>>>>> >>>>>>> shutdownLBHttpSolrServer >>>>>> >>>>>> CloudSolrServer uses a LBHttpSolrServer behind the scenes to >>> distribute >>>>> requests (that aren't updates directly to leaders). Where did you find >>>>> this? I don't see this in the javadoc anywhere but it is a boolean in >>> the >>>>> CloudSolrServer class. It looks like when you create a new >>> CloudSolrServer >>>>> and pass it your own LBHttpSolrServer the boolean gets set to false >>> and the >>>>> CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut >>> down. >>>>>> >>>>>>> parellelUpdates >>>>>> >>>>>> The javadoc's done have any description for this one but I checked out >>>>> the code for CloudSolrServer and if parallelUpdates it looks like it >>>>> executes update statements to multiple shards at the same time. >>>>> >>>>> Right, we should def add some javadoc, but this sends updates to >>> shards in >>>>> parallel rather than with a single thread. Can really increase update >>>>> speed. Still not as powerful as using CloudSolrServer from multiple >>>>> threads, but a nice improvement non the less. >>>>> >>>>> >>>>> - Mark >>>>> >>>>> http://about.me/markrmiller >>>>> >>>>>> >>>>>> I'm no dev but I can read so please excuse any errors on my part. >>>>>> >>>>>> Thanks, >>>>>> Greg >>>>>> >>>>>> On Jan 31, 2014, at 11:40 AM, Software Dev <static.void....@gmail.com >>>> >>>>> wrote: >>>>>> >>>>>>> Can someone clarify what the following options are: >>>>>>> >>>>>>> - updatesToLeaders >>>>>>> - shutdownLBHttpSolrServer >>>>>>> - parallelUpdates >>>>>>> >>>>>>> Also, I remember in older version of Solr there was an efficient >>> format >>>>>>> that was used between SolrJ and Solr that is more compact. Does this >>>>> sill >>>>>>> exist in the latest version of Solr? If so, is it the default? >>>>>>> >>>>>>> Thanks >>>>>> >>>>> >>>>> >>> >>> >>