Re: Configuring the Distributed

Jamie Johnson Fri, 02 Dec 2011 10:17:13 -0800

So I'm a fool.  I did set the numShards, the issue was so trivial it's
embarrassing.  I did indeed have it setup as a replica, the shard
names in solr.xml were both shard1.  This worked as I expected now.


On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller <markrmil...@gmail.com> wrote:
>
> They are unused params, so removing them wouldn't help anything.
>
> You might just want to wait till we are further along before playing with it.
>
> Or if you submit your full self contained test, I can see what's going on (eg 
> its still unclear if you have started setting numShards?).
>
> I can do a similar set of actions in my tests and it works fine. The only 
> reason I could see things working like this is if it thinks you have one 
> shard - a leader and a replica.
>
> - Mark
>
> On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:
>
>> Glad to hear I don't need to set shards/self, but removing them didn't
>> seem to change what I'm seeing.  Doing this still results in 2
>> documents 1 on 8983 and 1 on 7574.
>>
>> String key = "1";
>>
>>               SolrInputDocument solrDoc = new SolrInputDocument();
>>               solrDoc.setField("key", key);
>>
>>               solrDoc.addField("content_mvtxt", "initial value");
>>
>>               SolrServer server = 
>> servers.get("http://localhost:8983/solr/collection1";);
>>
>>               UpdateRequest ureq = new UpdateRequest();
>>               ureq.setParam("update.chain", "distrib-update-chain");
>>               ureq.add(solrDoc);
>>               ureq.setAction(ACTION.COMMIT, true, true);
>>               server.request(ureq);
>>               server.commit();
>>
>>               solrDoc = new SolrInputDocument();
>>               solrDoc.addField("key", key);
>>               solrDoc.addField("content_mvtxt", "updated value");
>>
>>               server = servers.get("http://localhost:7574/solr/collection1";);
>>
>>               ureq = new UpdateRequest();
>>               ureq.setParam("update.chain", "distrib-update-chain");
>>               ureq.add(solrDoc);
>>               ureq.setAction(ACTION.COMMIT, true, true);
>>               server.request(ureq);
>>               server.commit();
>>
>>               server = servers.get("http://localhost:8983/solr/collection1";);
>>
>>
>>               server.commit();
>>               System.out.println("done");
>>
>> On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller <markrmil...@gmail.com> wrote:
>>> So I dunno. You are running a zk server and running in zk mode right?
>>>
>>> You don't need to / shouldn't set a shards or self param. The shards are
>>> figured out from Zookeeper.
>>>
>>> You always want to use the distrib-update-chain. Eventually it will
>>> probably be part of the default chain and auto turn in zk mode.
>>>
>>> If you are running in zk mode attached to a zk server, this should work no
>>> problem. You can add docs to any server and they will be forwarded to the
>>> correct shard leader and then versioned and forwarded to replicas.
>>>
>>> You can also use the CloudSolrServer solrj client - that way you don't even
>>> have to choose a server to send docs too - in which case if it went down
>>> you would have to choose another manually - CloudSolrServer automatically
>>> finds one that is up through ZooKeeper. Eventually it will also be smart
>>> and do the hashing itself so that it can send directly to the shard leader
>>> that the doc would be forwarded to anyway.
>>>
>>> - Mark
>>>
>>> On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson <jej2...@gmail.com> wrote:
>>>
>>>> Really just trying to do a simple add and update test, the chain
>>>> missing is just proof of my not understanding exactly how this is
>>>> supposed to work.  I modified the code to this
>>>>
>>>>                String key = "1";
>>>>
>>>>                SolrInputDocument solrDoc = new SolrInputDocument();
>>>>                solrDoc.setField("key", key);
>>>>
>>>>                 solrDoc.addField("content_mvtxt", "initial value");
>>>>
>>>>                SolrServer server = servers
>>>>                                .get("
>>>> http://localhost:8983/solr/collection1";);
>>>>
>>>>                 UpdateRequest ureq = new UpdateRequest();
>>>>                ureq.setParam("update.chain", "distrib-update-chain");
>>>>                ureq.add(solrDoc);
>>>>                ureq.setParam("shards",
>>>>
>>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>>>                ureq.setParam("self", "foo");
>>>>                ureq.setAction(ACTION.COMMIT, true, true);
>>>>                server.request(ureq);
>>>>                 server.commit();
>>>>
>>>>                solrDoc = new SolrInputDocument();
>>>>                solrDoc.addField("key", key);
>>>>                 solrDoc.addField("content_mvtxt", "updated value");
>>>>
>>>>                server = servers.get("
>>>> http://localhost:7574/solr/collection1";);
>>>>
>>>>                 ureq = new UpdateRequest();
>>>>                ureq.setParam("update.chain", "distrib-update-chain");
>>>>                 // ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285");
>>>>                 ureq.add(solrDoc);
>>>>                ureq.setParam("shards",
>>>>
>>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>>>                ureq.setParam("self", "foo");
>>>>                ureq.setAction(ACTION.COMMIT, true, true);
>>>>                server.request(ureq);
>>>>                 // server.add(solrDoc);
>>>>                server.commit();
>>>>                server = servers.get("
>>>> http://localhost:8983/solr/collection1";);
>>>>
>>>>
>>>>                server.commit();
>>>>                System.out.println("done");
>>>>
>>>> but I'm still seeing the doc appear on both shards.    After the first
>>>> commit I see the doc on 8983 with "initial value".  after the second
>>>> commit I see the updated value on 7574 and the old on 8983.  After the
>>>> final commit the doc on 8983 gets updated.
>>>>
>>>> Is there something wrong with my test?
>>>>
>>>> On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <markrmil...@gmail.com>
>>>> wrote:
>>>>> Getting late - didn't really pay attention to your code I guess - why
>>>> are you adding the first doc without specifying the distrib update chain?
>>>> This is not really supported. It's going to just go to the server you
>>>> specified - even with everything setup right, the update might then go to
>>>> that same server or the other one depending on how it hashes. You really
>>>> want to just always use the distrib update chain.  I guess I don't yet
>>>> understand what you are trying to test.
>>>>>
>>>>> Sent from my iPad
>>>>>
>>>>> On Dec 1, 2011, at 10:57 PM, Mark Miller <markrmil...@gmail.com> wrote:
>>>>>
>>>>>> Not sure offhand - but things will be funky if you don't specify the
>>>> correct numShards.
>>>>>>
>>>>>> The instance to shard assignment should be using numShards to assign.
>>>> But then the hash to shard mapping actually goes on the number of shards it
>>>> finds registered in ZK (it doesn't have to, but really these should be
>>>> equal).
>>>>>>
>>>>>> So basically you are saying, I want 3 partitions, but you are only
>>>> starting up 2 nodes, and the code is just not happy about that I'd guess.
>>>> For the system to work properly, you have to fire up at least as many
>>>> servers as numShards.
>>>>>>
>>>>>> What are you trying to do? 2 partitions with no replicas, or one
>>>> partition with one replica?
>>>>>>
>>>>>> In either case, I think you will have better luck if you fire up at
>>>> least as many servers as the numShards setting. Or lower the numShards
>>>> setting.
>>>>>>
>>>>>> This is all a work in progress by the way - what you are trying to test
>>>> should work if things are setup right though.
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>>
>>>>>> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
>>>>>>
>>>>>>> Thanks for the quick response.  With that change (have not done
>>>>>>> numShards yet) shard1 got updated.  But now when executing the
>>>>>>> following queries I get information back from both, which doesn't seem
>>>>>>> right
>>>>>>>
>>>>>>> http://localhost:7574/solr/select/?q=*:*
>>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>>>> value</str></doc>
>>>>>>>
>>>>>>> http://localhost:8983/solr/select?q=*:*
>>>>>>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>>>> value</str></doc>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <markrmil...@gmail.com>
>>>> wrote:
>>>>>>>> Hmm...sorry bout that - so my first guess is that right now we are
>>>> not distributing a commit (easy to add, just have not done it).
>>>>>>>>
>>>>>>>> Right now I explicitly commit on each server for tests.
>>>>>>>>
>>>>>>>> Can you try explicitly committing on server1 after updating the doc
>>>> on server 2?
>>>>>>>>
>>>>>>>> I can start distributing commits tomorrow - been meaning to do it for
>>>> my own convenience anyhow.
>>>>>>>>
>>>>>>>> Also, you want to pass the sys property numShards=1 on startup. I
>>>> think it defaults to 3. That will give you one leader and one replica.
>>>>>>>>
>>>>>>>> - Mark
>>>>>>>>
>>>>>>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
>>>>>>>>
>>>>>>>>> So I couldn't resist, I attempted to do this tonight, I used the
>>>>>>>>> solrconfig you mentioned (as is, no modifications), I setup a 2 shard
>>>>>>>>> cluster in collection1, I sent 1 doc to 1 of the shards, updated it
>>>>>>>>> and sent the update to the other.  I don't see the modifications
>>>>>>>>> though I only see the original document.  The following is the test
>>>>>>>>>
>>>>>>>>> public void update() throws Exception {
>>>>>>>>>
>>>>>>>>>              String key = "1";
>>>>>>>>>
>>>>>>>>>              SolrInputDocument solrDoc = new SolrInputDocument();
>>>>>>>>>              solrDoc.setField("key", key);
>>>>>>>>>
>>>>>>>>>              solrDoc.addField("content", "initial value");
>>>>>>>>>
>>>>>>>>>              SolrServer server = servers
>>>>>>>>>                              .get("
>>>> http://localhost:8983/solr/collection1";);
>>>>>>>>>              server.add(solrDoc);
>>>>>>>>>
>>>>>>>>>              server.commit();
>>>>>>>>>
>>>>>>>>>              solrDoc = new SolrInputDocument();
>>>>>>>>>              solrDoc.addField("key", key);
>>>>>>>>>              solrDoc.addField("content", "updated value");
>>>>>>>>>
>>>>>>>>>              server = servers.get("
>>>> http://localhost:7574/solr/collection1";);
>>>>>>>>>
>>>>>>>>>              UpdateRequest ureq = new UpdateRequest();
>>>>>>>>>              ureq.setParam("update.chain", "distrib-update-chain");
>>>>>>>>>              ureq.add(solrDoc);
>>>>>>>>>              ureq.setParam("shards",
>>>>>>>>>
>>>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>>>>>>>>              ureq.setParam("self", "foo");
>>>>>>>>>              ureq.setAction(ACTION.COMMIT, true, true);
>>>>>>>>>              server.request(ureq);
>>>>>>>>>              System.out.println("done");
>>>>>>>>>      }
>>>>>>>>>
>>>>>>>>> key is my unique field in schema.xml
>>>>>>>>>
>>>>>>>>> What am I doing wrong?
>>>>>>>>>
>>>>>>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <jej2...@gmail.com>
>>>> wrote:
>>>>>>>>>> Yes, the ZK method seems much more flexible.  Adding a new shard
>>>> would
>>>>>>>>>> be simply updating the range assignments in ZK.  Where is this
>>>>>>>>>> currently on the list of things to accomplish?  I don't have time to
>>>>>>>>>> work on this now, but if you (or anyone) could provide direction I'd
>>>>>>>>>> be willing to work on this when I had spare time.  I guess a JIRA
>>>>>>>>>> detailing where/how to do this could help.  Not sure if the design
>>>> has
>>>>>>>>>> been thought out that far though.
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <markrmil...@gmail.com>
>>>> wrote:
>>>>>>>>>>> Right now lets say you have one shard - everything there hashes to
>>>> range X.
>>>>>>>>>>>
>>>>>>>>>>> Now you want to split that shard with an Index Splitter.
>>>>>>>>>>>
>>>>>>>>>>> You divide range X in two - giving you two ranges - then you start
>>>> splitting. This is where the current Splitter needs a little modification.
>>>> You decide which doc should go into which new index by rehashing each doc
>>>> id in the index you are splitting - if its hash is greater than X/2, it
>>>> goes into index1 - if its less, index2. I think there are a couple current
>>>> Splitter impls, but one of them does something like, give me an id - now if
>>>> the id's in the index are above that id, goto index1, if below, index2. We
>>>> need to instead do a quick hash rather than simple id compare.
>>>>>>>>>>>
>>>>>>>>>>> Why do you need to do this on every shard?
>>>>>>>>>>>
>>>>>>>>>>> The other part we need that we dont have is to store hash range
>>>> assignments in zookeeper - we don't do that yet because it's not needed
>>>> yet. Instead we currently just simply calculate that on the fly (too often
>>>> at the moment - on every request :) I intend to fix that of course).
>>>>>>>>>>>
>>>>>>>>>>> At the start, zk would say, for range X, goto this shard. After
>>>> the split, it would say, for range less than X/2 goto the old node, for
>>>> range greater than X/2 goto the new node.
>>>>>>>>>>>
>>>>>>>>>>> - Mark
>>>>>>>>>>>
>>>>>>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>>>>>>>>>>
>>>>>>>>>>>> hmmm.....This doesn't sound like the hashing algorithm that's on
>>>> the
>>>>>>>>>>>> branch, right?  The algorithm you're mentioning sounds like there
>>>> is
>>>>>>>>>>>> some logic which is able to tell that a particular range should be
>>>>>>>>>>>> distributed between 2 shards instead of 1.  So seems like a trade
>>>> off
>>>>>>>>>>>> between repartitioning the entire index (on every shard) and
>>>> having a
>>>>>>>>>>>> custom hashing algorithm which is able to handle the situation
>>>> where 2
>>>>>>>>>>>> or more shards map to a particular range.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <
>>>> markrmil...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am not familiar with the index splitter that is in contrib,
>>>> but I'll
>>>>>>>>>>>>>> take a look at it soon.  So the process sounds like it would be
>>>> to run
>>>>>>>>>>>>>> this on all of the current shards indexes based on the hash
>>>> algorithm.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Not something I've thought deeply about myself yet, but I think
>>>> the idea would be to split as many as you felt you needed to.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you wanted to keep the full balance always, this would mean
>>>> splitting every shard at once, yes. But this depends on how many boxes
>>>> (partitions) you are willing/able to add at a time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> You might just split one index to start - now it's hash range
>>>> would be handled by two shards instead of one (if you have 3 replicas per
>>>> shard, this would mean adding 3 more boxes). When you needed to expand
>>>> again, you would split another index that was still handling its full
>>>> starting range. As you grow, once you split every original index, you'd
>>>> start again, splitting one of the now half ranges.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there also an index merger in contrib which could be used to
>>>> merge
>>>>>>>>>>>>>> indexes?  I'm assuming this would be the process?
>>>>>>>>>>>>>
>>>>>>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an
>>>> admin command that can do this). But I'm not sure where this fits in?
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <
>>>> markrmil...@gmail.com> wrote:
>>>>>>>>>>>>>>> Not yet - we don't plan on working on this until a lot of
>>>> other stuff is
>>>>>>>>>>>>>>> working solid at this point. But someone else could jump in!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are a couple ways to go about it that I know of:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A more long term solution may be to start using micro shards -
>>>> each index
>>>>>>>>>>>>>>> starts as multiple indexes. This makes it pretty fast to move
>>>> mirco shards
>>>>>>>>>>>>>>> around as you decide to change partitions. It's also less
>>>> flexible as you
>>>>>>>>>>>>>>> are limited by the number of micro shards you start with.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A more simple and likely first step is to use an index
>>>> splitter . We
>>>>>>>>>>>>>>> already have one in lucene contrib - we would just need to
>>>> modify it so
>>>>>>>>>>>>>>> that it splits based on the hash of the document id. This is
>>>> super
>>>>>>>>>>>>>>> flexible, but splitting will obviously take a little while on
>>>> a huge index.
>>>>>>>>>>>>>>> The current index splitter is a multi pass splitter - good
>>>> enough to start
>>>>>>>>>>>>>>> with, but most files under codec control these days, we may be
>>>> able to make
>>>>>>>>>>>>>>> a single pass splitter soon as well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Eventually you could imagine using both options - micro shards
>>>> that could
>>>>>>>>>>>>>>> also be split as needed. Though I still wonder if micro shards
>>>> will be
>>>>>>>>>>>>>>> worth the extra complications myself...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Right now though, the idea is that you should pick a good
>>>> number of
>>>>>>>>>>>>>>> partitions to start given your expected data ;) Adding more
>>>> replicas is
>>>>>>>>>>>>>>> trivial though.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <
>>>> jej2...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Another question, is there any support for repartitioning of
>>>> the index
>>>>>>>>>>>>>>>> if a new shard is added?  What is the recommended approach for
>>>>>>>>>>>>>>>> handling this?  It seemed that the hashing algorithm (and
>>>> probably
>>>>>>>>>>>>>>>> any) would require the index to be repartitioned should a new
>>>> shard be
>>>>>>>>>>>>>>>> added.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <
>>>> jej2...@gmail.com> wrote:
>>>>>>>>>>>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <
>>>> markrmil...@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <
>>>> jej2...@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I am currently looking at the latest solrcloud branch and
>>>> was
>>>>>>>>>>>>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in
>>>> solrconfig.xml needs
>>>>>>>>>>>>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>>>>>>>>>>>>> solr/core/src/test-files
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You need to enable the update log, add an empty replication
>>>> handler def,
>>>>>>>>>>>>>>>>>> and an update chain with
>>>> solr.DistributedUpdateProcessFactory in it.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Mark Miller
>>>>>>>>>>>>> lucidimagination.com
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> - Mark Miller
>>>>>>>>>>> lucidimagination.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> - Mark Miller
>>>>>>>> lucidimagination.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> - Mark Miller
>>>>>> lucidimagination.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>>
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>

Re: Configuring the Distributed

Reply via email to