Hi, Here the log:
corosync.log
Description: Binary data
> On 22 Jun 2018, at 12:10, Christine Caulfield <[email protected]> wrote: > > On 22/06/18 10:39, Salvatore D'angelo wrote: >> Hi, >> >> Can you tell me exactly which log you need. I’ll provide you as soon as >> possible. >> >> Regarding some settings, I am not the original author of this cluster. >> People created it left the company I am working with and I inerithed the >> code and sometime I do not know why some settings are used. >> The old versions of pacemaker, corosync, crash and resource agents were >> compiled and installed. >> I simply downloaded the new versions compiled and installed them. I didn’t >> get any compliant during ./configure that usually checks for library >> compatibility. >> >> To be honest I do not know if this is the right approach. Should I “make >> unistall" old versions before installing the new one? >> Which is the suggested approach? >> Thank in advance for your help. >> > > OK fair enough! > > To be honest the best approach is almost always to get the latest > packages from the distributor rather than compile from source. That way > you can be more sure that upgrades will be more smoothly. Though, to be > honest, I'm not sure how good the Ubuntu packages are (they might be > great, they might not, I genuinely don't know) > > When building from source and if you don't know the provenance of the > previous version then I would recommend a 'make uninstall' first - or > removal of the packages if that's where they came from. > > One thing you should do is make sure that all the cluster nodes are > running the same version. If some are running older versions then nodes > could drop out for obscure reasons. We try and keep minor versions > on-wire compatible but it's always best to be cautious. > > The tidying of your corosync.conf wan wait for the moment, lets get > things mostly working first. If you enable debug logging in corosync.conf: > > logging { > to_syslog: yes > debug: on > } > > Then see what happens and post the syslog file that has all of the > corosync messages in it, we'll take it from there. > > Chrissie > >>> On 22 Jun 2018, at 11:30, Christine Caulfield <[email protected]> wrote: >>> >>> On 22/06/18 10:14, Salvatore D'angelo wrote: >>>> Hi Christine, >>>> >>>> Thanks for reply. Let me add few details. When I run the corosync >>>> service I se the corosync process running. If I stop it and run: >>>> >>>> corosync -f >>>> >>>> I see three warnings: >>>> warning [MAIN ] interface section bindnetaddr is used together with >>>> nodelist. Nodelist one is going to be used. >>>> warning [MAIN ] Please migrate config file to nodelist. >>>> warning [MAIN ] Could not set SCHED_RR at priority 99: Operation not >>>> permitted (1) >>>> warning [MAIN ] Could not set priority -2147483648: Permission denied (13) >>>> >>>> but I see node joined. >>>> >>> >>> Those certainly need fixing but are probably not the cause. Also why do >>> you have these values below set? >>> >>> max_network_delay: 100 >>> retransmits_before_loss_const: 25 >>> window_size: 150 >>> >>> I'm not saying they are causing the trouble, but they aren't going to >>> help keep a stable cluster. >>> >>> Without more logs (full logs are always better than just the bits you >>> think are meaningful) I still can't be sure. it could easily be just >>> that you've overwritten a packaged version of corosync with your own >>> compiled one and they have different configure options or that the >>> libraries now don't match. >>> >>> Chrissie >>> >>> >>>> My corosync.conf file is below. >>>> >>>> With service corosync up and running I have the following output: >>>> *corosync-cfgtool -s* >>>> Printing ring status. >>>> Local node ID 1 >>>> RING ID 0 >>>> id= 10.0.0.11 >>>> status= ring 0 active with no faults >>>> RING ID 1 >>>> id= 192.168.0.11 >>>> status= ring 1 active with no faults >>>> >>>> *corosync-cmapctl | grep members* >>>> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0 >>>> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1) >>>> ip(192.168.0.11) >>>> runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1 >>>> runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined >>>> runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0 >>>> runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1) >>>> ip(192.168.0.12) >>>> runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1 >>>> runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined >>>> >>>> For the moment I have two nodes in my cluster (third node and some >>>> issues and at the moment I did crm node standby on it). >>>> >>>> Here the dependency I have installed for corosync (that works fine with >>>> pacemaker 1.1.14 and corosync 2.3.5): >>>> libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb >>>> libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb >>>> libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb >>>> libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb >>>> libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb >>>> libqb-dev_0.16.0.real-1ubuntu4_amd64.deb >>>> libqb0_0.16.0.real-1ubuntu4_amd64.deb >>>> >>>> *corosync.conf* >>>> --------------------- >>>> quorum { >>>> provider: corosync_votequorum >>>> expected_votes: 3 >>>> } >>>> totem { >>>> version: 2 >>>> crypto_cipher: none >>>> crypto_hash: none >>>> rrp_mode: passive >>>> interface { >>>> ringnumber: 0 >>>> bindnetaddr: 10.0.0.0 >>>> mcastport: 5405 >>>> ttl: 1 >>>> } >>>> interface { >>>> ringnumber: 1 >>>> bindnetaddr: 192.168.0.0 >>>> mcastport: 5405 >>>> ttl: 1 >>>> } >>>> transport: udpu >>>> max_network_delay: 100 >>>> retransmits_before_loss_const: 25 >>>> window_size: 150 >>>> } >>>> nodelist { >>>> node { >>>> ring0_addr: pg1 >>>> ring1_addr: pg1p >>>> nodeid: 1 >>>> } >>>> node { >>>> ring0_addr: pg2 >>>> ring1_addr: pg2p >>>> nodeid: 2 >>>> } >>>> node { >>>> ring0_addr: pg3 >>>> ring1_addr: pg3p >>>> nodeid: 3 >>>> } >>>> } >>>> logging { >>>> to_syslog: yes >>>> } >>>> >>>> >>>> >>>> >>>>> On 22 Jun 2018, at 09:24, Christine Caulfield <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> On 21/06/18 16:16, Salvatore D'angelo wrote: >>>>>> Hi, >>>>>> >>>>>> I upgraded my PostgreSQL/Pacemaker cluster with these versions. >>>>>> Pacemaker 1.1.14 -> 1.1.18 >>>>>> Corosync 2.3.5 -> 2.4.4 >>>>>> Crmsh 2.2.0 -> 3.0.1 >>>>>> Resource agents 3.9.7 -> 4.1.1 >>>>>> >>>>>> I started on a first node (I am trying one node at a time upgrade). >>>>>> On a PostgreSQL slave node I did: >>>>>> >>>>>> *crm node standby <node>* >>>>>> *service pacemaker stop* >>>>>> *service corosync stop* >>>>>> >>>>>> Then I build the tool above as described on their GitHub.com >>>>>> <http://GitHub.com> >>>>>> <http://GitHub.com <http://github.com/>> page. >>>>>> >>>>>> *./autogen.sh (where required)* >>>>>> *./configure* >>>>>> *make (where required)* >>>>>> *make install* >>>>>> >>>>>> Everything went ok. I expect new file overwrite old one. I left the >>>>>> dependency I had with old software because I noticed the .configure >>>>>> didn’t complain. >>>>>> I started corosync. >>>>>> >>>>>> *service corosync start* >>>>>> >>>>>> To verify corosync work properly I used the following commands: >>>>>> *corosync-cfg-tool -s* >>>>>> *corosync-cmapctl | grep members* >>>>>> >>>>>> Everything seemed ok and I verified my node joined the cluster (at least >>>>>> this is my impression). >>>>>> >>>>>> Here I verified a problem. Doing the command: >>>>>> corosync-quorumtool -ps >>>>>> >>>>>> I got the following problem: >>>>>> Cannot initialise CFG service >>>>>> >>>>> That says that corosync is not running. Have a look in the log files to >>>>> see why it stopped. The pacemaker logs below are showing the same thing, >>>>> but we can't make any more guesses until we see what corosync itself is >>>>> doing. Enabling debug in corosync.conf will also help if more detail is >>>>> needed. >>>>> >>>>> Also starting corosync with 'corosync -pf' on the command-line is often >>>>> a quick way of checking things are starting OK. >>>>> >>>>> Chrissie >>>>> >>>>> >>>>>> If I try to start pacemaker, I only see pacemaker process running and >>>>>> pacemaker.log containing the following lines: >>>>>> >>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: info: crm_log_init:Changed >>>>>> active directory to /var/lib/pacemaker/cores/ >>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: info: >>>>>> get_cluster_type:Detected an active 'corosync' cluster/ >>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: info: >>>>>> mcp_read_config:Reading configure for stack: corosync/ >>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: notice: main:Starting >>>>>> Pacemaker 1.1.18 | build=2b07d5c5a9 features: libqb-logging libqb-ipc >>>>>> lha-fencing nagios corosync-native atomic-attrd acls/ >>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: info: main:Maximum core >>>>>> file size is: 18446744073709551615/ >>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: info: >>>>>> qb_ipcs_us_publish:server name: pacemakerd/ >>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: warning: >>>>>> corosync_node_name:Could not connect to Cluster Configuration Database >>>>>> API, error CS_ERR_TRY_AGAIN/ >>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: >>>>>> corosync_node_name:Unable to get node name for nodeid 1/ >>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: notice: get_node_name:Could >>>>>> not obtain a node name for corosync nodeid 1/ >>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: crm_get_peer:Created >>>>>> entry 1aeef8ac-643b-44f7-8ce3-d82bbf40bbc1/0x557dc7f05d30 for node >>>>>> (null)/1 (1 total)/ >>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: crm_get_peer:Node 1 >>>>>> has uuid 1/ >>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: >>>>>> crm_update_peer_proc:cluster_connect_cpg: Node (null)[1] - corosync-cpg >>>>>> is now online/ >>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: error: >>>>>> cluster_connect_quorum:Could not connect to the Quorum API: 2/ >>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: >>>>>> qb_ipcs_us_withdraw:withdrawing server sockets/ >>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: main:Exiting >>>>>> pacemakerd/ >>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: >>>>>> crm_xml_cleanup:Cleaning up memory from libxml2/ >>>>>> >>>>>> *What is wrong in my procedure?* >>>>>> >>>>>> >>>>>> > _______________________________________________ > Users mailing list: [email protected] > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
_______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
