On 05/26/2016 03:17 PM, Vladislav Bogdanov wrote: > Hi all, > > here is a list of issues found during testing of a setup with 2 cluster > nodes, 8 remote nodes and around 450 resources. I hope it could be > useful to do some polishing before 1.1.15 release. pacemaker version is > quite close to 1.1.15-rc1
Thanks, this is useful. > * templates are not supported for ocf:pacemaker:remote > * fencing events may be lost due to long transition run time ( already > discussed) > * cib becomes unresponsive when uploading many changes, that leads to > sbd fencing (if sbd is enabled) > * node-action-limit seems to work on a per-cluster-node basis, so it > limits number of operations run on all remote nodes connected by a given > cluster node > * changing many node attributes during the transition run may lead to > transition-recalculation-storm (found with a resource-agent which > changes dozens of attributes) > * notice: Relying on watchdog integration for fencing - this should > probably needs to be reworded/downgraded FYI there was a regression introduced in 1.1.14 that resulted in have-watchdog always being true (and the above message being printed) regardless of whether sbd was actually running. That has been fixed and the fix will be in 1.1.15rc3 (which I intend to release tomorrow). > * application of a big enough CIB diff results in monitor failures - CPU > hog? CIB hang? > * crmd[9834]: crit: GLib: g_hash_table_lookup: assertion 'hash_table > != NULL' failed - hope to catch this again next week as coredump is lost > * pacemaker looses resource exit from a pending state > (Starting/Stopping/Migrating) change is visible in logs of a local node > (or crmd manages a given remote node) but is not propagated to CIB > * crmd crash discovered after moving DC node to standby > segfault in crmd's remote-related code (lrmd client) - hope to catch > this again next week > * failcounts for resources on remote nodes are not properly cleaned up > (related to pending states enabled???) > * many "warning: No reason to expect node XXX to be down" when deleting > attributes on remote nodes > * "error: Query resulted in an error: Timer expired" when adding > attributes on remote nodes > * the same when uploading CIB patch > * attrd[23798]: notice: Update error (unknown peer uuid, retry will be > attempted once uuid is discovered): <node>[<attribute>]=(null) failed > (host=0x2921ae0) - needs to be reinvestigated The above could be related to a bug introduced after 1.1.14, having to do with reusing node IDs when removing/adding nodes. It is now fixed, and the fix will be in 1.1.15rc3. > If there any interest in additional information, I can gather it next > week when I have access to a hardware again. > > Hope this could be useful, > > Vladislav _______________________________________________ Developers mailing list [email protected] http://clusterlabs.org/mailman/listinfo/developers
