Re: SolrCloud - Strategy for recovering cluster states

danny teichthal Wed, 02 Mar 2016 00:13:23 -0800

Thanks Jeff,
I understand your philosophy and it sounds correct.
Since we had many problems with zookeeper when switching to Solr Cloud. we
couldn't make it as a source of knowledge and had to relay on a more stable
source.
The issues is that when we get such an event of zookeeper, it brought our
system down, and in this case, clearing the core.properties were a life
saver.
We've managed to make it pretty stable not, but we will always need a
"dooms day" weapon.


I looked into the related JIRA and it confused me a little, and raised a
few other questions:
1. What exactly defines zookeeper as a truth?
2. What is the role of core.properties if the state is only in zookeeper?



Your tool is very interesting, I just thought about writing such a tool
myself.
>From the sources I understand that you represent each node as a path in the
git repository.
So, I guess that for restore purposes I will have to do
the opposite direction and create a node for every path entry.




On Tue, Mar 1, 2016 at 11:36 PM, Jeff Wartes <jwar...@whitepages.com> wrote:

>
> I’ve been running SolrCloud clusters in various versions for a few years
> here, and I can only think of two or three cases that the ZK-stored cluster
> state was broken in a way that I had to manually intervene by hand-editing
> the contents of ZK. I think I’ve seen Solr fixes go by for those cases,
> too. I’ve never completely wiped ZK. (Although granted, my ZK cluster has
> been pretty stable, and my collection count is smaller than yours)
>
> My philosophy is that ZK is the source of cluster configuration, not the
> collection of core.properties files on the nodes.
> Currently, cluster state is shared between ZK and core directories. I’d
> prefer, and I think Solr development is going this way, (SOLR-7269) that
> all cluster state exist and be managed via ZK, and all state be removed
> from the local disk of the cluster nodes. The fact that a node uses local
> disk based configuration to figure out what collections/replicas it has is
> something that should be fixed, in my opinion.
>
> If you’re frequently getting into bad states due to ZK issues, I’d suggest
> you file bugs against Solr for the fact that you got into the state, and
> then fix your ZK cluster.
>
> Failing that, can you just periodically back up your ZK data and restore
> it if something breaks? I wrote a little tool to watch clusterstate.json
> and write every version to a local git repo a few years ago. I was mostly
> interested because I wanted to see changes that happened pretty fast, but
> it could also serve as a backup approach. Here’s a link, although I clearly
> haven’t touched it lately. Feel free to ask if you have issues:
> https://github.com/randomstatistic/git_zk_monitor
>
>
>
>
> On 3/1/16, 12:09 PM, "danny teichthal" <dannyt...@gmail.com> wrote:
>
> >Hi,
> >Just summarizing my questions if the long mail is a little intimidating:
> >1. Is there a best practice/automated tool for overcoming problems in
> >cluster state coming from zookeeper disconnections?
> >2. Creating a collection via core admin is discouraged, is it true also
> for
> >core.properties discovery?
> >
> >I would like to be able to specify collection.configName in the
> >core.properties and when starting server, the collection will be created
> >and linked to the config name specified.
> >
> >
> >
> >On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal <dannyt...@gmail.com>
> >wrote:
> >
> >> Hi,
> >>
> >>
> >> I would like to describe a process we use for overcoming problems in
> >> cluster state when we have networking issues. Would appreciate if anyone
> >> can answer about what are the flaws on this solution and what is the
> best
> >> practice for recovery in case of network problems involving zookeeper.
> >> I'm working with Solr Cloud with version 5.2.1
> >> ~100 collections in a cluster of 6 machines.
> >>
> >> This is the short procedure:
> >> 1. Bring all the cluster down.
> >> 2. Clear all data from zookeeper.
> >> 3. Upload configuration.
> >> 4. Restart the cluster.
> >>
> >> We rely on the fact that a collection is created on core discovery
> >> process, if it does not exist. It gives us much flexibility.
> >> When the cluster comes up, it reads from core.properties and creates the
> >> collections if needed.
> >> Since we have only one configuration, the collections are automatically
> >> linked to it and the cores inherit it from the collection.
> >> This is a very robust procedure, that helped us overcome many problems
> >> until we stabilized our cluster which is now pretty stable.
> >> I know that the leader might change in such case and may lose updates,
> but
> >> it is ok.
> >>
> >>
> >> The problem is that today I want to add a new config set.
> >> When I add it and clear zookeeper, the cores cannot be created because
> >> there are 2 configurations. This breaks my recovery procedure.
> >>
> >> I thought about a few options:
> >> 1. Put the config Name in core.properties - this doesn't work. (It is
> >> supported in CoreAdminHandler, but  is discouraged according to
> >> documentation)
> >> 2. Change recovery procedure to not delete all data from zookeeper, but
> >> only relevant parts.
> >> 3. Change recovery procedure to delete all, but recreate and link
> >> configurations for all collections before startup.
> >>
> >> Option #1 is my favorite, because it is very simple, it is currently not
> >> supported, but from looking on code it looked like it is not complex to
> >> implement.
> >>
> >>
> >>
> >> My questions are:
> >> 1. Is there something wrong in the recovery procedure that I described ?
> >> 2. What is the best way to fix problems in cluster state, except from
> >> editing clusterstate.json manually? Is there an automated tool for
> that? We
> >> have about 100 collections in a cluster, so editing is not really a
> >> solution.
> >> 3.Is creating a collection via core.properties is also discouraged?
> >>
> >>
> >>
> >> Would very appreciate any answers/ thoughts on that.
> >>
> >>
> >> Thanks,
> >>
> >>
> >>
> >>
> >>
> >>
>

Re: SolrCloud - Strategy for recovering cluster states

Reply via email to