Thanks Jeff, I understand your philosophy and it sounds correct. Since we had many problems with zookeeper when switching to Solr Cloud. we couldn't make it as a source of knowledge and had to relay on a more stable source. The issues is that when we get such an event of zookeeper, it brought our system down, and in this case, clearing the core.properties were a life saver. We've managed to make it pretty stable not, but we will always need a "dooms day" weapon.
I looked into the related JIRA and it confused me a little, and raised a few other questions: 1. What exactly defines zookeeper as a truth? 2. What is the role of core.properties if the state is only in zookeeper? Your tool is very interesting, I just thought about writing such a tool myself. >From the sources I understand that you represent each node as a path in the git repository. So, I guess that for restore purposes I will have to do the opposite direction and create a node for every path entry. On Tue, Mar 1, 2016 at 11:36 PM, Jeff Wartes <jwar...@whitepages.com> wrote: > > I’ve been running SolrCloud clusters in various versions for a few years > here, and I can only think of two or three cases that the ZK-stored cluster > state was broken in a way that I had to manually intervene by hand-editing > the contents of ZK. I think I’ve seen Solr fixes go by for those cases, > too. I’ve never completely wiped ZK. (Although granted, my ZK cluster has > been pretty stable, and my collection count is smaller than yours) > > My philosophy is that ZK is the source of cluster configuration, not the > collection of core.properties files on the nodes. > Currently, cluster state is shared between ZK and core directories. I’d > prefer, and I think Solr development is going this way, (SOLR-7269) that > all cluster state exist and be managed via ZK, and all state be removed > from the local disk of the cluster nodes. The fact that a node uses local > disk based configuration to figure out what collections/replicas it has is > something that should be fixed, in my opinion. > > If you’re frequently getting into bad states due to ZK issues, I’d suggest > you file bugs against Solr for the fact that you got into the state, and > then fix your ZK cluster. > > Failing that, can you just periodically back up your ZK data and restore > it if something breaks? I wrote a little tool to watch clusterstate.json > and write every version to a local git repo a few years ago. I was mostly > interested because I wanted to see changes that happened pretty fast, but > it could also serve as a backup approach. Here’s a link, although I clearly > haven’t touched it lately. Feel free to ask if you have issues: > https://github.com/randomstatistic/git_zk_monitor > > > > > On 3/1/16, 12:09 PM, "danny teichthal" <dannyt...@gmail.com> wrote: > > >Hi, > >Just summarizing my questions if the long mail is a little intimidating: > >1. Is there a best practice/automated tool for overcoming problems in > >cluster state coming from zookeeper disconnections? > >2. Creating a collection via core admin is discouraged, is it true also > for > >core.properties discovery? > > > >I would like to be able to specify collection.configName in the > >core.properties and when starting server, the collection will be created > >and linked to the config name specified. > > > > > > > >On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal <dannyt...@gmail.com> > >wrote: > > > >> Hi, > >> > >> > >> I would like to describe a process we use for overcoming problems in > >> cluster state when we have networking issues. Would appreciate if anyone > >> can answer about what are the flaws on this solution and what is the > best > >> practice for recovery in case of network problems involving zookeeper. > >> I'm working with Solr Cloud with version 5.2.1 > >> ~100 collections in a cluster of 6 machines. > >> > >> This is the short procedure: > >> 1. Bring all the cluster down. > >> 2. Clear all data from zookeeper. > >> 3. Upload configuration. > >> 4. Restart the cluster. > >> > >> We rely on the fact that a collection is created on core discovery > >> process, if it does not exist. It gives us much flexibility. > >> When the cluster comes up, it reads from core.properties and creates the > >> collections if needed. > >> Since we have only one configuration, the collections are automatically > >> linked to it and the cores inherit it from the collection. > >> This is a very robust procedure, that helped us overcome many problems > >> until we stabilized our cluster which is now pretty stable. > >> I know that the leader might change in such case and may lose updates, > but > >> it is ok. > >> > >> > >> The problem is that today I want to add a new config set. > >> When I add it and clear zookeeper, the cores cannot be created because > >> there are 2 configurations. This breaks my recovery procedure. > >> > >> I thought about a few options: > >> 1. Put the config Name in core.properties - this doesn't work. (It is > >> supported in CoreAdminHandler, but is discouraged according to > >> documentation) > >> 2. Change recovery procedure to not delete all data from zookeeper, but > >> only relevant parts. > >> 3. Change recovery procedure to delete all, but recreate and link > >> configurations for all collections before startup. > >> > >> Option #1 is my favorite, because it is very simple, it is currently not > >> supported, but from looking on code it looked like it is not complex to > >> implement. > >> > >> > >> > >> My questions are: > >> 1. Is there something wrong in the recovery procedure that I described ? > >> 2. What is the best way to fix problems in cluster state, except from > >> editing clusterstate.json manually? Is there an automated tool for > that? We > >> have about 100 collections in a cluster, so editing is not really a > >> solution. > >> 3.Is creating a collection via core.properties is also discouraged? > >> > >> > >> > >> Would very appreciate any answers/ thoughts on that. > >> > >> > >> Thanks, > >> > >> > >> > >> > >> > >> >