Re: Configuring the Distributed

Jamie Johnson Fri, 02 Dec 2011 09:41:37 -0800

Glad to hear I don't need to set shards/self, but removing them didn't
seem to change what I'm seeing.  Doing this still results in 2
documents 1 on 8983 and 1 on 7574.


String key = "1";

                SolrInputDocument solrDoc = new SolrInputDocument();
                solrDoc.setField("key", key);

                solrDoc.addField("content_mvtxt", "initial value");

                SolrServer server = 
servers.get("http://localhost:8983/solr/collection1";);

                UpdateRequest ureq = new UpdateRequest();
                ureq.setParam("update.chain", "distrib-update-chain");
                ureq.add(solrDoc);
                ureq.setAction(ACTION.COMMIT, true, true);
                server.request(ureq);
                server.commit();

                solrDoc = new SolrInputDocument();
                solrDoc.addField("key", key);
                solrDoc.addField("content_mvtxt", "updated value");

                server = servers.get("http://localhost:7574/solr/collection1";);

                ureq = new UpdateRequest();
                ureq.setParam("update.chain", "distrib-update-chain");
                ureq.add(solrDoc);
                ureq.setAction(ACTION.COMMIT, true, true);
                server.request(ureq);
                server.commit();

                server = servers.get("http://localhost:8983/solr/collection1";);
        

                server.commit();
                System.out.println("done");

On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller <markrmil...@gmail.com> wrote:
> So I dunno. You are running a zk server and running in zk mode right?
>
> You don't need to / shouldn't set a shards or self param. The shards are
> figured out from Zookeeper.
>
> You always want to use the distrib-update-chain. Eventually it will
> probably be part of the default chain and auto turn in zk mode.
>
> If you are running in zk mode attached to a zk server, this should work no
> problem. You can add docs to any server and they will be forwarded to the
> correct shard leader and then versioned and forwarded to replicas.
>
> You can also use the CloudSolrServer solrj client - that way you don't even
> have to choose a server to send docs too - in which case if it went down
> you would have to choose another manually - CloudSolrServer automatically
> finds one that is up through ZooKeeper. Eventually it will also be smart
> and do the hashing itself so that it can send directly to the shard leader
> that the doc would be forwarded to anyway.
>
> - Mark
>
> On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson <jej2...@gmail.com> wrote:
>
>> Really just trying to do a simple add and update test, the chain
>> missing is just proof of my not understanding exactly how this is
>> supposed to work.  I modified the code to this
>>
>>                String key = "1";
>>
>>                SolrInputDocument solrDoc = new SolrInputDocument();
>>                solrDoc.setField("key", key);
>>
>>                 solrDoc.addField("content_mvtxt", "initial value");
>>
>>                SolrServer server = servers
>>                                .get("
>> http://localhost:8983/solr/collection1";);
>>
>>                 UpdateRequest ureq = new UpdateRequest();
>>                ureq.setParam("update.chain", "distrib-update-chain");
>>                ureq.add(solrDoc);
>>                ureq.setParam("shards",
>>
>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>                ureq.setParam("self", "foo");
>>                ureq.setAction(ACTION.COMMIT, true, true);
>>                server.request(ureq);
>>                 server.commit();
>>
>>                solrDoc = new SolrInputDocument();
>>                solrDoc.addField("key", key);
>>                 solrDoc.addField("content_mvtxt", "updated value");
>>
>>                server = servers.get("
>> http://localhost:7574/solr/collection1";);
>>
>>                 ureq = new UpdateRequest();
>>                ureq.setParam("update.chain", "distrib-update-chain");
>>                 // ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285");
>>                 ureq.add(solrDoc);
>>                ureq.setParam("shards",
>>
>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>>                ureq.setParam("self", "foo");
>>                ureq.setAction(ACTION.COMMIT, true, true);
>>                server.request(ureq);
>>                 // server.add(solrDoc);
>>                server.commit();
>>                server = servers.get("
>> http://localhost:8983/solr/collection1";);
>>
>>
>>                server.commit();
>>                System.out.println("done");
>>
>> but I'm still seeing the doc appear on both shards.    After the first
>> commit I see the doc on 8983 with "initial value".  after the second
>> commit I see the updated value on 7574 and the old on 8983.  After the
>> final commit the doc on 8983 gets updated.
>>
>> Is there something wrong with my test?
>>
>> On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <markrmil...@gmail.com>
>> wrote:
>> > Getting late - didn't really pay attention to your code I guess - why
>> are you adding the first doc without specifying the distrib update chain?
>> This is not really supported. It's going to just go to the server you
>> specified - even with everything setup right, the update might then go to
>> that same server or the other one depending on how it hashes. You really
>> want to just always use the distrib update chain.  I guess I don't yet
>> understand what you are trying to test.
>> >
>> > Sent from my iPad
>> >
>> > On Dec 1, 2011, at 10:57 PM, Mark Miller <markrmil...@gmail.com> wrote:
>> >
>> >> Not sure offhand - but things will be funky if you don't specify the
>> correct numShards.
>> >>
>> >> The instance to shard assignment should be using numShards to assign.
>> But then the hash to shard mapping actually goes on the number of shards it
>> finds registered in ZK (it doesn't have to, but really these should be
>> equal).
>> >>
>> >> So basically you are saying, I want 3 partitions, but you are only
>> starting up 2 nodes, and the code is just not happy about that I'd guess.
>> For the system to work properly, you have to fire up at least as many
>> servers as numShards.
>> >>
>> >> What are you trying to do? 2 partitions with no replicas, or one
>> partition with one replica?
>> >>
>> >> In either case, I think you will have better luck if you fire up at
>> least as many servers as the numShards setting. Or lower the numShards
>> setting.
>> >>
>> >> This is all a work in progress by the way - what you are trying to test
>> should work if things are setup right though.
>> >>
>> >> - Mark
>> >>
>> >>
>> >> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
>> >>
>> >>> Thanks for the quick response.  With that change (have not done
>> >>> numShards yet) shard1 got updated.  But now when executing the
>> >>> following queries I get information back from both, which doesn't seem
>> >>> right
>> >>>
>> >>> http://localhost:7574/solr/select/?q=*:*
>> >>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>> value</str></doc>
>> >>>
>> >>> http://localhost:8983/solr/select?q=*:*
>> >>> <doc><str name="key">1</str><str name="content_mvtxt">updated
>> value</str></doc>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <markrmil...@gmail.com>
>> wrote:
>> >>>> Hmm...sorry bout that - so my first guess is that right now we are
>> not distributing a commit (easy to add, just have not done it).
>> >>>>
>> >>>> Right now I explicitly commit on each server for tests.
>> >>>>
>> >>>> Can you try explicitly committing on server1 after updating the doc
>> on server 2?
>> >>>>
>> >>>> I can start distributing commits tomorrow - been meaning to do it for
>> my own convenience anyhow.
>> >>>>
>> >>>> Also, you want to pass the sys property numShards=1 on startup. I
>> think it defaults to 3. That will give you one leader and one replica.
>> >>>>
>> >>>> - Mark
>> >>>>
>> >>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
>> >>>>
>> >>>>> So I couldn't resist, I attempted to do this tonight, I used the
>> >>>>> solrconfig you mentioned (as is, no modifications), I setup a 2 shard
>> >>>>> cluster in collection1, I sent 1 doc to 1 of the shards, updated it
>> >>>>> and sent the update to the other.  I don't see the modifications
>> >>>>> though I only see the original document.  The following is the test
>> >>>>>
>> >>>>> public void update() throws Exception {
>> >>>>>
>> >>>>>              String key = "1";
>> >>>>>
>> >>>>>              SolrInputDocument solrDoc = new SolrInputDocument();
>> >>>>>              solrDoc.setField("key", key);
>> >>>>>
>> >>>>>              solrDoc.addField("content", "initial value");
>> >>>>>
>> >>>>>              SolrServer server = servers
>> >>>>>                              .get("
>> http://localhost:8983/solr/collection1";);
>> >>>>>              server.add(solrDoc);
>> >>>>>
>> >>>>>              server.commit();
>> >>>>>
>> >>>>>              solrDoc = new SolrInputDocument();
>> >>>>>              solrDoc.addField("key", key);
>> >>>>>              solrDoc.addField("content", "updated value");
>> >>>>>
>> >>>>>              server = servers.get("
>> http://localhost:7574/solr/collection1";);
>> >>>>>
>> >>>>>              UpdateRequest ureq = new UpdateRequest();
>> >>>>>              ureq.setParam("update.chain", "distrib-update-chain");
>> >>>>>              ureq.add(solrDoc);
>> >>>>>              ureq.setParam("shards",
>> >>>>>
>>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>> >>>>>              ureq.setParam("self", "foo");
>> >>>>>              ureq.setAction(ACTION.COMMIT, true, true);
>> >>>>>              server.request(ureq);
>> >>>>>              System.out.println("done");
>> >>>>>      }
>> >>>>>
>> >>>>> key is my unique field in schema.xml
>> >>>>>
>> >>>>> What am I doing wrong?
>> >>>>>
>> >>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <jej2...@gmail.com>
>> wrote:
>> >>>>>> Yes, the ZK method seems much more flexible.  Adding a new shard
>> would
>> >>>>>> be simply updating the range assignments in ZK.  Where is this
>> >>>>>> currently on the list of things to accomplish?  I don't have time to
>> >>>>>> work on this now, but if you (or anyone) could provide direction I'd
>> >>>>>> be willing to work on this when I had spare time.  I guess a JIRA
>> >>>>>> detailing where/how to do this could help.  Not sure if the design
>> has
>> >>>>>> been thought out that far though.
>> >>>>>>
>> >>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <markrmil...@gmail.com>
>> wrote:
>> >>>>>>> Right now lets say you have one shard - everything there hashes to
>> range X.
>> >>>>>>>
>> >>>>>>> Now you want to split that shard with an Index Splitter.
>> >>>>>>>
>> >>>>>>> You divide range X in two - giving you two ranges - then you start
>> splitting. This is where the current Splitter needs a little modification.
>> You decide which doc should go into which new index by rehashing each doc
>> id in the index you are splitting - if its hash is greater than X/2, it
>> goes into index1 - if its less, index2. I think there are a couple current
>> Splitter impls, but one of them does something like, give me an id - now if
>> the id's in the index are above that id, goto index1, if below, index2. We
>> need to instead do a quick hash rather than simple id compare.
>> >>>>>>>
>> >>>>>>> Why do you need to do this on every shard?
>> >>>>>>>
>> >>>>>>> The other part we need that we dont have is to store hash range
>> assignments in zookeeper - we don't do that yet because it's not needed
>> yet. Instead we currently just simply calculate that on the fly (too often
>> at the moment - on every request :) I intend to fix that of course).
>> >>>>>>>
>> >>>>>>> At the start, zk would say, for range X, goto this shard. After
>> the split, it would say, for range less than X/2 goto the old node, for
>> range greater than X/2 goto the new node.
>> >>>>>>>
>> >>>>>>> - Mark
>> >>>>>>>
>> >>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>> >>>>>>>
>> >>>>>>>> hmmm.....This doesn't sound like the hashing algorithm that's on
>> the
>> >>>>>>>> branch, right?  The algorithm you're mentioning sounds like there
>> is
>> >>>>>>>> some logic which is able to tell that a particular range should be
>> >>>>>>>> distributed between 2 shards instead of 1.  So seems like a trade
>> off
>> >>>>>>>> between repartitioning the entire index (on every shard) and
>> having a
>> >>>>>>>> custom hashing algorithm which is able to handle the situation
>> where 2
>> >>>>>>>> or more shards map to a particular range.
>> >>>>>>>>
>> >>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <
>> markrmil...@gmail.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>> >>>>>>>>>
>> >>>>>>>>>> I am not familiar with the index splitter that is in contrib,
>> but I'll
>> >>>>>>>>>> take a look at it soon.  So the process sounds like it would be
>> to run
>> >>>>>>>>>> this on all of the current shards indexes based on the hash
>> algorithm.
>> >>>>>>>>>
>> >>>>>>>>> Not something I've thought deeply about myself yet, but I think
>> the idea would be to split as many as you felt you needed to.
>> >>>>>>>>>
>> >>>>>>>>> If you wanted to keep the full balance always, this would mean
>> splitting every shard at once, yes. But this depends on how many boxes
>> (partitions) you are willing/able to add at a time.
>> >>>>>>>>>
>> >>>>>>>>> You might just split one index to start - now it's hash range
>> would be handled by two shards instead of one (if you have 3 replicas per
>> shard, this would mean adding 3 more boxes). When you needed to expand
>> again, you would split another index that was still handling its full
>> starting range. As you grow, once you split every original index, you'd
>> start again, splitting one of the now half ranges.
>> >>>>>>>>>
>> >>>>>>>>>> Is there also an index merger in contrib which could be used to
>> merge
>> >>>>>>>>>> indexes?  I'm assuming this would be the process?
>> >>>>>>>>>
>> >>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an
>> admin command that can do this). But I'm not sure where this fits in?
>> >>>>>>>>>
>> >>>>>>>>> - Mark
>> >>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <
>> markrmil...@gmail.com> wrote:
>> >>>>>>>>>>> Not yet - we don't plan on working on this until a lot of
>> other stuff is
>> >>>>>>>>>>> working solid at this point. But someone else could jump in!
>> >>>>>>>>>>>
>> >>>>>>>>>>> There are a couple ways to go about it that I know of:
>> >>>>>>>>>>>
>> >>>>>>>>>>> A more long term solution may be to start using micro shards -
>> each index
>> >>>>>>>>>>> starts as multiple indexes. This makes it pretty fast to move
>> mirco shards
>> >>>>>>>>>>> around as you decide to change partitions. It's also less
>> flexible as you
>> >>>>>>>>>>> are limited by the number of micro shards you start with.
>> >>>>>>>>>>>
>> >>>>>>>>>>> A more simple and likely first step is to use an index
>> splitter . We
>> >>>>>>>>>>> already have one in lucene contrib - we would just need to
>> modify it so
>> >>>>>>>>>>> that it splits based on the hash of the document id. This is
>> super
>> >>>>>>>>>>> flexible, but splitting will obviously take a little while on
>> a huge index.
>> >>>>>>>>>>> The current index splitter is a multi pass splitter - good
>> enough to start
>> >>>>>>>>>>> with, but most files under codec control these days, we may be
>> able to make
>> >>>>>>>>>>> a single pass splitter soon as well.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Eventually you could imagine using both options - micro shards
>> that could
>> >>>>>>>>>>> also be split as needed. Though I still wonder if micro shards
>> will be
>> >>>>>>>>>>> worth the extra complications myself...
>> >>>>>>>>>>>
>> >>>>>>>>>>> Right now though, the idea is that you should pick a good
>> number of
>> >>>>>>>>>>> partitions to start given your expected data ;) Adding more
>> replicas is
>> >>>>>>>>>>> trivial though.
>> >>>>>>>>>>>
>> >>>>>>>>>>> - Mark
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <
>> jej2...@gmail.com> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Another question, is there any support for repartitioning of
>> the index
>> >>>>>>>>>>>> if a new shard is added?  What is the recommended approach for
>> >>>>>>>>>>>> handling this?  It seemed that the hashing algorithm (and
>> probably
>> >>>>>>>>>>>> any) would require the index to be repartitioned should a new
>> shard be
>> >>>>>>>>>>>> added.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <
>> jej2...@gmail.com> wrote:
>> >>>>>>>>>>>>> Thanks I will try this first thing in the morning.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <
>> markrmil...@gmail.com>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <
>> jej2...@gmail.com>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> I am currently looking at the latest solrcloud branch and
>> was
>> >>>>>>>>>>>>>>> wondering if there was any documentation on configuring the
>> >>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in
>> solrconfig.xml needs
>> >>>>>>>>>>>>>>> to be added/modified to make distributed indexing work?
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>> >>>>>>>>>>>>>> solr/core/src/test-files
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> You need to enable the update log, add an empty replication
>> handler def,
>> >>>>>>>>>>>>>> and an update chain with
>> solr.DistributedUpdateProcessFactory in it.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> http://www.lucidimagination.com
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> --
>> >>>>>>>>>>> - Mark
>> >>>>>>>>>>>
>> >>>>>>>>>>> http://www.lucidimagination.com
>> >>>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> - Mark Miller
>> >>>>>>>>> lucidimagination.com
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>> - Mark Miller
>> >>>>>>> lucidimagination.com
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>>> - Mark Miller
>> >>>> lucidimagination.com
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>
>> >> - Mark Miller
>> >> lucidimagination.com
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >
>>
>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>

Re: Configuring the Distributed

Reply via email to