Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

Jonathan Davies Thu, 19 Oct 2017 02:14:52 -0700


On 18/10/17 16:18, Jan Friesse wrote:

Jonathan,
On 18/10/17 14:38, Jan Friesse wrote:
Can you please try to remove
"votequorum_exec_send_nodeinfo(us->node_id);" line from votequorum.c
in the votequorum_exec_init_fn function (around line 2306) and let me
know if problem persists?
Wow! With that change, I'm pleased to say that I'm not able to reproduce
the problem at all!
Sounds good.
Is this a legitimate fix, or do we still need the call to
votequorum_exec_send_nodeinfo for other reasons?
That is good question. Calling of votequorum_exec_send_nodeinfo should not be needed because it's called by sync_process only slightly later.
But to mark this as a legitimate fix, I would like to find out why is this happening and if it is legal or not. Basically because I'm not able to reproduce the bug at all (and I was really trying also with various usleeps/packet loss/...) I would like to have more information about notworking_cluster1.log. Because tracing doesn't work, we need to try blackbox. Could you please add
icmap_set_string("runtime.blackbox.dump_flight_data", "yes");

line before api->shutdown_request(); in cmap.c ?
It should trigger dumping blackbox in /var/lib/corosync. When you reproduce the nonworking_cluster1, could you please ether:
- compress the file pointed by /var/lib/corosync/fdata symlink
- or execute corosync-blackbox
- or execute qb-blackbox "/var/lib/corosync/fdata"

and send it?


Attached, along with the "debug: trace" log from cluster2.

Thanks,
Jonathan

fdata-2017-10-19T10:05:12-17515.gz
Description: application/gzip

notworking_cluster1.log.gz
Description: application/gzip

notworking_cluster2.log.gz
Description: application/gzip

_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

Reply via email to