On 2018-10-01 12:55 PM, Patrick Whitney wrote: > Fencing in clustering is always required, but unlike pacemaker that lets > you turn it off and take your chances, DLM doesn't. > > > As a matter of fact, DLM has a setting "enable_fencing=0|1" for what > that's worth.
I did not know that... Interesting. Dangerous, but interesting. > You must have > working fencing for DLM (and anything using it) to function correctly. > > > We do have fencing enabled in the cluster; we've tested both node level > fencing and resource fencing; DLM behaved identically in both scenarios, > until we set it to 'enable_fencing=0' in the dlm.conf file. > > > Basically, cluster config changes (node declared lost), dlm informed and > blocks, fence attempt begins and loops until it succeeds, on success, > informs DLM, dlm reaps locks held by the lost node and normal operation > continues. > > This isn't quite what I was seeing in the logs. The "failed" node would > be fenced off, pacemaker appeared to be sane, reporting services running > on the running nodes, but once the failed node was seen as missing by > dlm (dlm_controld), dlm would request fencing, from what I can tell by > the log entry. Here is an example of the suspect log entry: > Sep 26 09:41:35 pcmk-test-1 dlm_controld[837]: 38 fence request 2 pid > 1446 startup time 1537969264 fence_all dlm_stonith > > > This isn't a question of node count or other configuration concerns. > It's simply that you must have proper fencing for DLM. > > > Can you speak more to what "proper fencing" is for DLM? > > Best, > -Pat > > > > On Mon, Oct 1, 2018 at 12:30 PM Digimer <[email protected] > <mailto:[email protected]>> wrote: > > On 2018-10-01 12:04 PM, Ferenc Wágner wrote: > > Patrick Whitney <[email protected] > <mailto:[email protected]>> writes: > > > >> I have a two node (test) cluster running corosync/pacemaker with DLM > >> and CLVM. > >> > >> I was running into an issue where when one node failed, the > remaining node > >> would appear to do the right thing, from the pcmk perspective, > that is. > >> It would create a new cluster (of one) and fence the other node, but > >> then, rather surprisingly, DLM would see the other node offline, > and it > >> would go offline itself, abandoning the lockspace. > >> > >> I changed my DLM settings to "enable_fencing=0", disabling DLM > fencing, and > >> our tests are now working as expected. > > > > I'm running a larger Pacemaker cluster with standalone DLM + cLVM > (that > > is, they are started by systemd, not by Pacemaker). I've seen > weird DLM > > fencing behavior, but not what you describe above (though I ran with > > more than two nodes from the very start). Actually, I don't even > > understand how it occured to you to disable DLM fencing to fix that... > > Fencing in clustering is always required, but unlike pacemaker that lets > you turn it off and take your chances, DLM doesn't. You must have > working fencing for DLM (and anything using it) to function correctly. > > Basically, cluster config changes (node declared lost), dlm informed and > blocks, fence attempt begins and loops until it succeeds, on success, > informs DLM, dlm reaps locks held by the lost node and normal operation > continues. > > This isn't a question of node count or other configuration concerns. > It's simply that you must have proper fencing for DLM. > > >> I'm a little concern I have masked an issue by doing this, as in all > >> of the tutorials and docs I've read, there is no mention of having to > >> configure DLM whatsoever. > > > > Unfortunately it's very hard to come by any reliable info about > DLM. I > > had a couple of enlightening exchanges with David Teigland (its > primary > > author) on this list, he is very helpful indeed, but I'm still > very far > > from having a working understanding of it. > > > > But I've been running with --enable_fencing=0 for years without > issues, > > leaving all fencing to Pacemaker. Note that manual cLVM > operations are > > the only users of DLM here, so delayed fencing does not cause any > > problems, the cluster services do not depend on DLM being > operational (I > > mean it can stay frozen for several days -- as it happened in a couple > > of pathological cases). GFS2 would be a very different thing, I > guess. > > > > > -- > Digimer > Papers and Projects: https://alteeve.com/w/ > "I am, somehow, less interested in the weight and convolutions of > Einstein’s brain than in the near certainty that people of equal talent > have lived and died in cotton fields and sweatshops." - Stephen Jay > Gould > > > > -- > Patrick Whitney > DevOps Engineer -- Tools -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould _______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
