Glad to hear I don't need to set shards/self, but removing them didn't seem to change what I'm seeing. Doing this still results in 2 documents 1 on 8983 and 1 on 7574.
String key = "1"; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField("key", key); solrDoc.addField("content_mvtxt", "initial value"); SolrServer server = servers.get("http://localhost:8983/solr/collection1"); UpdateRequest ureq = new UpdateRequest(); ureq.setParam("update.chain", "distrib-update-chain"); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField("key", key); solrDoc.addField("content_mvtxt", "updated value"); server = servers.get("http://localhost:7574/solr/collection1"); ureq = new UpdateRequest(); ureq.setParam("update.chain", "distrib-update-chain"); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); server = servers.get("http://localhost:8983/solr/collection1"); server.commit(); System.out.println("done"); On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller <markrmil...@gmail.com> wrote: > So I dunno. You are running a zk server and running in zk mode right? > > You don't need to / shouldn't set a shards or self param. The shards are > figured out from Zookeeper. > > You always want to use the distrib-update-chain. Eventually it will > probably be part of the default chain and auto turn in zk mode. > > If you are running in zk mode attached to a zk server, this should work no > problem. You can add docs to any server and they will be forwarded to the > correct shard leader and then versioned and forwarded to replicas. > > You can also use the CloudSolrServer solrj client - that way you don't even > have to choose a server to send docs too - in which case if it went down > you would have to choose another manually - CloudSolrServer automatically > finds one that is up through ZooKeeper. Eventually it will also be smart > and do the hashing itself so that it can send directly to the shard leader > that the doc would be forwarded to anyway. > > - Mark > > On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson <jej2...@gmail.com> wrote: > >> Really just trying to do a simple add and update test, the chain >> missing is just proof of my not understanding exactly how this is >> supposed to work. I modified the code to this >> >> String key = "1"; >> >> SolrInputDocument solrDoc = new SolrInputDocument(); >> solrDoc.setField("key", key); >> >> solrDoc.addField("content_mvtxt", "initial value"); >> >> SolrServer server = servers >> .get(" >> http://localhost:8983/solr/collection1"); >> >> UpdateRequest ureq = new UpdateRequest(); >> ureq.setParam("update.chain", "distrib-update-chain"); >> ureq.add(solrDoc); >> ureq.setParam("shards", >> >> "localhost:8983/solr/collection1,localhost:7574/solr/collection1"); >> ureq.setParam("self", "foo"); >> ureq.setAction(ACTION.COMMIT, true, true); >> server.request(ureq); >> server.commit(); >> >> solrDoc = new SolrInputDocument(); >> solrDoc.addField("key", key); >> solrDoc.addField("content_mvtxt", "updated value"); >> >> server = servers.get(" >> http://localhost:7574/solr/collection1"); >> >> ureq = new UpdateRequest(); >> ureq.setParam("update.chain", "distrib-update-chain"); >> // ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285"); >> ureq.add(solrDoc); >> ureq.setParam("shards", >> >> "localhost:8983/solr/collection1,localhost:7574/solr/collection1"); >> ureq.setParam("self", "foo"); >> ureq.setAction(ACTION.COMMIT, true, true); >> server.request(ureq); >> // server.add(solrDoc); >> server.commit(); >> server = servers.get(" >> http://localhost:8983/solr/collection1"); >> >> >> server.commit(); >> System.out.println("done"); >> >> but I'm still seeing the doc appear on both shards. After the first >> commit I see the doc on 8983 with "initial value". after the second >> commit I see the updated value on 7574 and the old on 8983. After the >> final commit the doc on 8983 gets updated. >> >> Is there something wrong with my test? >> >> On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <markrmil...@gmail.com> >> wrote: >> > Getting late - didn't really pay attention to your code I guess - why >> are you adding the first doc without specifying the distrib update chain? >> This is not really supported. It's going to just go to the server you >> specified - even with everything setup right, the update might then go to >> that same server or the other one depending on how it hashes. You really >> want to just always use the distrib update chain. I guess I don't yet >> understand what you are trying to test. >> > >> > Sent from my iPad >> > >> > On Dec 1, 2011, at 10:57 PM, Mark Miller <markrmil...@gmail.com> wrote: >> > >> >> Not sure offhand - but things will be funky if you don't specify the >> correct numShards. >> >> >> >> The instance to shard assignment should be using numShards to assign. >> But then the hash to shard mapping actually goes on the number of shards it >> finds registered in ZK (it doesn't have to, but really these should be >> equal). >> >> >> >> So basically you are saying, I want 3 partitions, but you are only >> starting up 2 nodes, and the code is just not happy about that I'd guess. >> For the system to work properly, you have to fire up at least as many >> servers as numShards. >> >> >> >> What are you trying to do? 2 partitions with no replicas, or one >> partition with one replica? >> >> >> >> In either case, I think you will have better luck if you fire up at >> least as many servers as the numShards setting. Or lower the numShards >> setting. >> >> >> >> This is all a work in progress by the way - what you are trying to test >> should work if things are setup right though. >> >> >> >> - Mark >> >> >> >> >> >> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote: >> >> >> >>> Thanks for the quick response. With that change (have not done >> >>> numShards yet) shard1 got updated. But now when executing the >> >>> following queries I get information back from both, which doesn't seem >> >>> right >> >>> >> >>> http://localhost:7574/solr/select/?q=*:* >> >>> <doc><str name="key">1</str><str name="content_mvtxt">updated >> value</str></doc> >> >>> >> >>> http://localhost:8983/solr/select?q=*:* >> >>> <doc><str name="key">1</str><str name="content_mvtxt">updated >> value</str></doc> >> >>> >> >>> >> >>> >> >>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <markrmil...@gmail.com> >> wrote: >> >>>> Hmm...sorry bout that - so my first guess is that right now we are >> not distributing a commit (easy to add, just have not done it). >> >>>> >> >>>> Right now I explicitly commit on each server for tests. >> >>>> >> >>>> Can you try explicitly committing on server1 after updating the doc >> on server 2? >> >>>> >> >>>> I can start distributing commits tomorrow - been meaning to do it for >> my own convenience anyhow. >> >>>> >> >>>> Also, you want to pass the sys property numShards=1 on startup. I >> think it defaults to 3. That will give you one leader and one replica. >> >>>> >> >>>> - Mark >> >>>> >> >>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote: >> >>>> >> >>>>> So I couldn't resist, I attempted to do this tonight, I used the >> >>>>> solrconfig you mentioned (as is, no modifications), I setup a 2 shard >> >>>>> cluster in collection1, I sent 1 doc to 1 of the shards, updated it >> >>>>> and sent the update to the other. I don't see the modifications >> >>>>> though I only see the original document. The following is the test >> >>>>> >> >>>>> public void update() throws Exception { >> >>>>> >> >>>>> String key = "1"; >> >>>>> >> >>>>> SolrInputDocument solrDoc = new SolrInputDocument(); >> >>>>> solrDoc.setField("key", key); >> >>>>> >> >>>>> solrDoc.addField("content", "initial value"); >> >>>>> >> >>>>> SolrServer server = servers >> >>>>> .get(" >> http://localhost:8983/solr/collection1"); >> >>>>> server.add(solrDoc); >> >>>>> >> >>>>> server.commit(); >> >>>>> >> >>>>> solrDoc = new SolrInputDocument(); >> >>>>> solrDoc.addField("key", key); >> >>>>> solrDoc.addField("content", "updated value"); >> >>>>> >> >>>>> server = servers.get(" >> http://localhost:7574/solr/collection1"); >> >>>>> >> >>>>> UpdateRequest ureq = new UpdateRequest(); >> >>>>> ureq.setParam("update.chain", "distrib-update-chain"); >> >>>>> ureq.add(solrDoc); >> >>>>> ureq.setParam("shards", >> >>>>> >> "localhost:8983/solr/collection1,localhost:7574/solr/collection1"); >> >>>>> ureq.setParam("self", "foo"); >> >>>>> ureq.setAction(ACTION.COMMIT, true, true); >> >>>>> server.request(ureq); >> >>>>> System.out.println("done"); >> >>>>> } >> >>>>> >> >>>>> key is my unique field in schema.xml >> >>>>> >> >>>>> What am I doing wrong? >> >>>>> >> >>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <jej2...@gmail.com> >> wrote: >> >>>>>> Yes, the ZK method seems much more flexible. Adding a new shard >> would >> >>>>>> be simply updating the range assignments in ZK. Where is this >> >>>>>> currently on the list of things to accomplish? I don't have time to >> >>>>>> work on this now, but if you (or anyone) could provide direction I'd >> >>>>>> be willing to work on this when I had spare time. I guess a JIRA >> >>>>>> detailing where/how to do this could help. Not sure if the design >> has >> >>>>>> been thought out that far though. >> >>>>>> >> >>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <markrmil...@gmail.com> >> wrote: >> >>>>>>> Right now lets say you have one shard - everything there hashes to >> range X. >> >>>>>>> >> >>>>>>> Now you want to split that shard with an Index Splitter. >> >>>>>>> >> >>>>>>> You divide range X in two - giving you two ranges - then you start >> splitting. This is where the current Splitter needs a little modification. >> You decide which doc should go into which new index by rehashing each doc >> id in the index you are splitting - if its hash is greater than X/2, it >> goes into index1 - if its less, index2. I think there are a couple current >> Splitter impls, but one of them does something like, give me an id - now if >> the id's in the index are above that id, goto index1, if below, index2. We >> need to instead do a quick hash rather than simple id compare. >> >>>>>>> >> >>>>>>> Why do you need to do this on every shard? >> >>>>>>> >> >>>>>>> The other part we need that we dont have is to store hash range >> assignments in zookeeper - we don't do that yet because it's not needed >> yet. Instead we currently just simply calculate that on the fly (too often >> at the moment - on every request :) I intend to fix that of course). >> >>>>>>> >> >>>>>>> At the start, zk would say, for range X, goto this shard. After >> the split, it would say, for range less than X/2 goto the old node, for >> range greater than X/2 goto the new node. >> >>>>>>> >> >>>>>>> - Mark >> >>>>>>> >> >>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: >> >>>>>>> >> >>>>>>>> hmmm.....This doesn't sound like the hashing algorithm that's on >> the >> >>>>>>>> branch, right? The algorithm you're mentioning sounds like there >> is >> >>>>>>>> some logic which is able to tell that a particular range should be >> >>>>>>>> distributed between 2 shards instead of 1. So seems like a trade >> off >> >>>>>>>> between repartitioning the entire index (on every shard) and >> having a >> >>>>>>>> custom hashing algorithm which is able to handle the situation >> where 2 >> >>>>>>>> or more shards map to a particular range. >> >>>>>>>> >> >>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller < >> markrmil...@gmail.com> wrote: >> >>>>>>>>> >> >>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: >> >>>>>>>>> >> >>>>>>>>>> I am not familiar with the index splitter that is in contrib, >> but I'll >> >>>>>>>>>> take a look at it soon. So the process sounds like it would be >> to run >> >>>>>>>>>> this on all of the current shards indexes based on the hash >> algorithm. >> >>>>>>>>> >> >>>>>>>>> Not something I've thought deeply about myself yet, but I think >> the idea would be to split as many as you felt you needed to. >> >>>>>>>>> >> >>>>>>>>> If you wanted to keep the full balance always, this would mean >> splitting every shard at once, yes. But this depends on how many boxes >> (partitions) you are willing/able to add at a time. >> >>>>>>>>> >> >>>>>>>>> You might just split one index to start - now it's hash range >> would be handled by two shards instead of one (if you have 3 replicas per >> shard, this would mean adding 3 more boxes). When you needed to expand >> again, you would split another index that was still handling its full >> starting range. As you grow, once you split every original index, you'd >> start again, splitting one of the now half ranges. >> >>>>>>>>> >> >>>>>>>>>> Is there also an index merger in contrib which could be used to >> merge >> >>>>>>>>>> indexes? I'm assuming this would be the process? >> >>>>>>>>> >> >>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an >> admin command that can do this). But I'm not sure where this fits in? >> >>>>>>>>> >> >>>>>>>>> - Mark >> >>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller < >> markrmil...@gmail.com> wrote: >> >>>>>>>>>>> Not yet - we don't plan on working on this until a lot of >> other stuff is >> >>>>>>>>>>> working solid at this point. But someone else could jump in! >> >>>>>>>>>>> >> >>>>>>>>>>> There are a couple ways to go about it that I know of: >> >>>>>>>>>>> >> >>>>>>>>>>> A more long term solution may be to start using micro shards - >> each index >> >>>>>>>>>>> starts as multiple indexes. This makes it pretty fast to move >> mirco shards >> >>>>>>>>>>> around as you decide to change partitions. It's also less >> flexible as you >> >>>>>>>>>>> are limited by the number of micro shards you start with. >> >>>>>>>>>>> >> >>>>>>>>>>> A more simple and likely first step is to use an index >> splitter . We >> >>>>>>>>>>> already have one in lucene contrib - we would just need to >> modify it so >> >>>>>>>>>>> that it splits based on the hash of the document id. This is >> super >> >>>>>>>>>>> flexible, but splitting will obviously take a little while on >> a huge index. >> >>>>>>>>>>> The current index splitter is a multi pass splitter - good >> enough to start >> >>>>>>>>>>> with, but most files under codec control these days, we may be >> able to make >> >>>>>>>>>>> a single pass splitter soon as well. >> >>>>>>>>>>> >> >>>>>>>>>>> Eventually you could imagine using both options - micro shards >> that could >> >>>>>>>>>>> also be split as needed. Though I still wonder if micro shards >> will be >> >>>>>>>>>>> worth the extra complications myself... >> >>>>>>>>>>> >> >>>>>>>>>>> Right now though, the idea is that you should pick a good >> number of >> >>>>>>>>>>> partitions to start given your expected data ;) Adding more >> replicas is >> >>>>>>>>>>> trivial though. >> >>>>>>>>>>> >> >>>>>>>>>>> - Mark >> >>>>>>>>>>> >> >>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson < >> jej2...@gmail.com> wrote: >> >>>>>>>>>>> >> >>>>>>>>>>>> Another question, is there any support for repartitioning of >> the index >> >>>>>>>>>>>> if a new shard is added? What is the recommended approach for >> >>>>>>>>>>>> handling this? It seemed that the hashing algorithm (and >> probably >> >>>>>>>>>>>> any) would require the index to be repartitioned should a new >> shard be >> >>>>>>>>>>>> added. >> >>>>>>>>>>>> >> >>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson < >> jej2...@gmail.com> wrote: >> >>>>>>>>>>>>> Thanks I will try this first thing in the morning. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller < >> markrmil...@gmail.com> >> >>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson < >> jej2...@gmail.com> >> >>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> I am currently looking at the latest solrcloud branch and >> was >> >>>>>>>>>>>>>>> wondering if there was any documentation on configuring the >> >>>>>>>>>>>>>>> DistributedUpdateProcessor? What specifically in >> solrconfig.xml needs >> >>>>>>>>>>>>>>> to be added/modified to make distributed indexing work? >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in >> >>>>>>>>>>>>>> solr/core/src/test-files >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> You need to enable the update log, add an empty replication >> handler def, >> >>>>>>>>>>>>>> and an update chain with >> solr.DistributedUpdateProcessFactory in it. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> -- >> >>>>>>>>>>>>>> - Mark >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> http://www.lucidimagination.com >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> -- >> >>>>>>>>>>> - Mark >> >>>>>>>>>>> >> >>>>>>>>>>> http://www.lucidimagination.com >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> - Mark Miller >> >>>>>>>>> lucidimagination.com >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>> >> >>>>>>> - Mark Miller >> >>>>>>> lucidimagination.com >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>> >> >>>> - Mark Miller >> >>>> lucidimagination.com >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >> >> >> - Mark Miller >> >> lucidimagination.com >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> > > > > -- > - Mark > > http://www.lucidimagination.com >