It's not full of details yet, but there is a JIRA issue here: https://issues.apache.org/jira/browse/SOLR-2595
On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <jej2...@gmail.com> wrote: > Yes, the ZK method seems much more flexible. Adding a new shard would > be simply updating the range assignments in ZK. Where is this > currently on the list of things to accomplish? I don't have time to > work on this now, but if you (or anyone) could provide direction I'd > be willing to work on this when I had spare time. I guess a JIRA > detailing where/how to do this could help. Not sure if the design has > been thought out that far though. > > On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <markrmil...@gmail.com> wrote: > > Right now lets say you have one shard - everything there hashes to range > X. > > > > Now you want to split that shard with an Index Splitter. > > > > You divide range X in two - giving you two ranges - then you start > splitting. This is where the current Splitter needs a little modification. > You decide which doc should go into which new index by rehashing each doc > id in the index you are splitting - if its hash is greater than X/2, it > goes into index1 - if its less, index2. I think there are a couple current > Splitter impls, but one of them does something like, give me an id - now if > the id's in the index are above that id, goto index1, if below, index2. We > need to instead do a quick hash rather than simple id compare. > > > > Why do you need to do this on every shard? > > > > The other part we need that we dont have is to store hash range > assignments in zookeeper - we don't do that yet because it's not needed > yet. Instead we currently just simply calculate that on the fly (too often > at the moment - on every request :) I intend to fix that of course). > > > > At the start, zk would say, for range X, goto this shard. After the > split, it would say, for range less than X/2 goto the old node, for range > greater than X/2 goto the new node. > > > > - Mark > > > > On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: > > > >> hmmm.....This doesn't sound like the hashing algorithm that's on the > >> branch, right? The algorithm you're mentioning sounds like there is > >> some logic which is able to tell that a particular range should be > >> distributed between 2 shards instead of 1. So seems like a trade off > >> between repartitioning the entire index (on every shard) and having a > >> custom hashing algorithm which is able to handle the situation where 2 > >> or more shards map to a particular range. > >> > >> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <markrmil...@gmail.com> > wrote: > >>> > >>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: > >>> > >>>> I am not familiar with the index splitter that is in contrib, but I'll > >>>> take a look at it soon. So the process sounds like it would be to run > >>>> this on all of the current shards indexes based on the hash algorithm. > >>> > >>> Not something I've thought deeply about myself yet, but I think the > idea would be to split as many as you felt you needed to. > >>> > >>> If you wanted to keep the full balance always, this would mean > splitting every shard at once, yes. But this depends on how many boxes > (partitions) you are willing/able to add at a time. > >>> > >>> You might just split one index to start - now it's hash range would be > handled by two shards instead of one (if you have 3 replicas per shard, > this would mean adding 3 more boxes). When you needed to expand again, you > would split another index that was still handling its full starting range. > As you grow, once you split every original index, you'd start again, > splitting one of the now half ranges. > >>> > >>>> Is there also an index merger in contrib which could be used to merge > >>>> indexes? I'm assuming this would be the process? > >>> > >>> You can merge with IndexWriter.addIndexes (Solr also has an admin > command that can do this). But I'm not sure where this fits in? > >>> > >>> - Mark > >>> > >>>> > >>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <markrmil...@gmail.com> > wrote: > >>>>> Not yet - we don't plan on working on this until a lot of other > stuff is > >>>>> working solid at this point. But someone else could jump in! > >>>>> > >>>>> There are a couple ways to go about it that I know of: > >>>>> > >>>>> A more long term solution may be to start using micro shards - each > index > >>>>> starts as multiple indexes. This makes it pretty fast to move mirco > shards > >>>>> around as you decide to change partitions. It's also less flexible > as you > >>>>> are limited by the number of micro shards you start with. > >>>>> > >>>>> A more simple and likely first step is to use an index splitter . We > >>>>> already have one in lucene contrib - we would just need to modify it > so > >>>>> that it splits based on the hash of the document id. This is super > >>>>> flexible, but splitting will obviously take a little while on a huge > index. > >>>>> The current index splitter is a multi pass splitter - good enough to > start > >>>>> with, but most files under codec control these days, we may be able > to make > >>>>> a single pass splitter soon as well. > >>>>> > >>>>> Eventually you could imagine using both options - micro shards that > could > >>>>> also be split as needed. Though I still wonder if micro shards will > be > >>>>> worth the extra complications myself... > >>>>> > >>>>> Right now though, the idea is that you should pick a good number of > >>>>> partitions to start given your expected data ;) Adding more replicas > is > >>>>> trivial though. > >>>>> > >>>>> - Mark > >>>>> > >>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <jej2...@gmail.com> > wrote: > >>>>> > >>>>>> Another question, is there any support for repartitioning of the > index > >>>>>> if a new shard is added? What is the recommended approach for > >>>>>> handling this? It seemed that the hashing algorithm (and probably > >>>>>> any) would require the index to be repartitioned should a new shard > be > >>>>>> added. > >>>>>> > >>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <jej2...@gmail.com> > wrote: > >>>>>>> Thanks I will try this first thing in the morning. > >>>>>>> > >>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <markrmil...@gmail.com > > > >>>>>> wrote: > >>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <jej2...@gmail.com > > > >>>>>> wrote: > >>>>>>>> > >>>>>>>>> I am currently looking at the latest solrcloud branch and was > >>>>>>>>> wondering if there was any documentation on configuring the > >>>>>>>>> DistributedUpdateProcessor? What specifically in solrconfig.xml > needs > >>>>>>>>> to be added/modified to make distributed indexing work? > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in > >>>>>>>> solr/core/src/test-files > >>>>>>>> > >>>>>>>> You need to enable the update log, add an empty replication > handler def, > >>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in > it. > >>>>>>>> > >>>>>>>> -- > >>>>>>>> - Mark > >>>>>>>> > >>>>>>>> http://www.lucidimagination.com > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> - Mark > >>>>> > >>>>> http://www.lucidimagination.com > >>>>> > >>> > >>> - Mark Miller > >>> lucidimagination.com > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > > > > - Mark Miller > > lucidimagination.com > > > > > > > > > > > > > > > > > > > > > > > > > -- - Mark http://www.lucidimagination.com