Re: Configuring the Distributed

Mark Miller Thu, 01 Dec 2011 19:11:31 -0800

It's not full of details yet, but there is a JIRA issue here:
https://issues.apache.org/jira/browse/SOLR-2595


On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <jej2...@gmail.com> wrote:

> Yes, the ZK method seems much more flexible.  Adding a new shard would
> be simply updating the range assignments in ZK.  Where is this
> currently on the list of things to accomplish?  I don't have time to
> work on this now, but if you (or anyone) could provide direction I'd
> be willing to work on this when I had spare time.  I guess a JIRA
> detailing where/how to do this could help.  Not sure if the design has
> been thought out that far though.
>
> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <markrmil...@gmail.com> wrote:
> > Right now lets say you have one shard - everything there hashes to range
> X.
> >
> > Now you want to split that shard with an Index Splitter.
> >
> > You divide range X in two - giving you two ranges - then you start
> splitting. This is where the current Splitter needs a little modification.
> You decide which doc should go into which new index by rehashing each doc
> id in the index you are splitting - if its hash is greater than X/2, it
> goes into index1 - if its less, index2. I think there are a couple current
> Splitter impls, but one of them does something like, give me an id - now if
> the id's in the index are above that id, goto index1, if below, index2. We
> need to instead do a quick hash rather than simple id compare.
> >
> > Why do you need to do this on every shard?
> >
> > The other part we need that we dont have is to store hash range
> assignments in zookeeper - we don't do that yet because it's not needed
> yet. Instead we currently just simply calculate that on the fly (too often
> at the moment - on every request :) I intend to fix that of course).
> >
> > At the start, zk would say, for range X, goto this shard. After the
> split, it would say, for range less than X/2 goto the old node, for range
> greater than X/2 goto the new node.
> >
> > - Mark
> >
> > On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
> >
> >> hmmm.....This doesn't sound like the hashing algorithm that's on the
> >> branch, right?  The algorithm you're mentioning sounds like there is
> >> some logic which is able to tell that a particular range should be
> >> distributed between 2 shards instead of 1.  So seems like a trade off
> >> between repartitioning the entire index (on every shard) and having a
> >> custom hashing algorithm which is able to handle the situation where 2
> >> or more shards map to a particular range.
> >>
> >> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <markrmil...@gmail.com>
> wrote:
> >>>
> >>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
> >>>
> >>>> I am not familiar with the index splitter that is in contrib, but I'll
> >>>> take a look at it soon.  So the process sounds like it would be to run
> >>>> this on all of the current shards indexes based on the hash algorithm.
> >>>
> >>> Not something I've thought deeply about myself yet, but I think the
> idea would be to split as many as you felt you needed to.
> >>>
> >>> If you wanted to keep the full balance always, this would mean
> splitting every shard at once, yes. But this depends on how many boxes
> (partitions) you are willing/able to add at a time.
> >>>
> >>> You might just split one index to start - now it's hash range would be
> handled by two shards instead of one (if you have 3 replicas per shard,
> this would mean adding 3 more boxes). When you needed to expand again, you
> would split another index that was still handling its full starting range.
> As you grow, once you split every original index, you'd start again,
> splitting one of the now half ranges.
> >>>
> >>>> Is there also an index merger in contrib which could be used to merge
> >>>> indexes?  I'm assuming this would be the process?
> >>>
> >>> You can merge with IndexWriter.addIndexes (Solr also has an admin
> command that can do this). But I'm not sure where this fits in?
> >>>
> >>> - Mark
> >>>
> >>>>
> >>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <markrmil...@gmail.com>
> wrote:
> >>>>> Not yet - we don't plan on working on this until a lot of other
> stuff is
> >>>>> working solid at this point. But someone else could jump in!
> >>>>>
> >>>>> There are a couple ways to go about it that I know of:
> >>>>>
> >>>>> A more long term solution may be to start using micro shards - each
> index
> >>>>> starts as multiple indexes. This makes it pretty fast to move mirco
> shards
> >>>>> around as you decide to change partitions. It's also less flexible
> as you
> >>>>> are limited by the number of micro shards you start with.
> >>>>>
> >>>>> A more simple and likely first step is to use an index splitter . We
> >>>>> already have one in lucene contrib - we would just need to modify it
> so
> >>>>> that it splits based on the hash of the document id. This is super
> >>>>> flexible, but splitting will obviously take a little while on a huge
> index.
> >>>>> The current index splitter is a multi pass splitter - good enough to
> start
> >>>>> with, but most files under codec control these days, we may be able
> to make
> >>>>> a single pass splitter soon as well.
> >>>>>
> >>>>> Eventually you could imagine using both options - micro shards that
> could
> >>>>> also be split as needed. Though I still wonder if micro shards will
> be
> >>>>> worth the extra complications myself...
> >>>>>
> >>>>> Right now though, the idea is that you should pick a good number of
> >>>>> partitions to start given your expected data ;) Adding more replicas
> is
> >>>>> trivial though.
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <jej2...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Another question, is there any support for repartitioning of the
> index
> >>>>>> if a new shard is added?  What is the recommended approach for
> >>>>>> handling this?  It seemed that the hashing algorithm (and probably
> >>>>>> any) would require the index to be repartitioned should a new shard
> be
> >>>>>> added.
> >>>>>>
> >>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <jej2...@gmail.com>
> wrote:
> >>>>>>> Thanks I will try this first thing in the morning.
> >>>>>>>
> >>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <markrmil...@gmail.com
> >
> >>>>>> wrote:
> >>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <jej2...@gmail.com
> >
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> I am currently looking at the latest solrcloud branch and was
> >>>>>>>>> wondering if there was any documentation on configuring the
> >>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml
> needs
> >>>>>>>>> to be added/modified to make distributed indexing work?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
> >>>>>>>> solr/core/src/test-files
> >>>>>>>>
> >>>>>>>> You need to enable the update log, add an empty replication
> handler def,
> >>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in
> it.
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> - Mark
> >>>>>>>>
> >>>>>>>> http://www.lucidimagination.com
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> - Mark
> >>>>>
> >>>>> http://www.lucidimagination.com
> >>>>>
> >>>
> >>> - Mark Miller
> >>> lucidimagination.com
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >
> > - Mark Miller
> > lucidimagination.com
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>



-- 
- Mark

http://www.lucidimagination.com

Re: Configuring the Distributed

Reply via email to