Re: [Pacemaker] hangs pending

Andrey Groshev Mon, 13 Jan 2014 01:37:59 -0800


13.01.2014, 02:51, "Andrew Beekhof" <[email protected]>:
> On 10 Jan 2014, at 9:55 pm, Andrey Groshev <[email protected]> wrote:
>
>>  10.01.2014, 14:31, "Andrey Groshev" <[email protected]>:
>>>  10.01.2014, 14:01, "Andrew Beekhof" <[email protected]>:
>>>>   On 10 Jan 2014, at 5:03 pm, Andrey Groshev <[email protected]> wrote:
>>>>>    10.01.2014, 05:29, "Andrew Beekhof" <[email protected]>:
>>>>>>     On 9 Jan 2014, at 11:11 pm, Andrey Groshev <[email protected]> wrote:
>>>>>>>      08.01.2014, 06:22, "Andrew Beekhof" <[email protected]>:
>>>>>>>>      On 29 Nov 2013, at 7:17 pm, Andrey Groshev <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>>       Hi, ALL.
>>>>>>>>>
>>>>>>>>>       I'm still trying to cope with the fact that after the fence - 
>>>>>>>>> node hangs in "pending".
>>>>>>>>      Please define "pending".  Where did you see this?
>>>>>>>      In crm_mon:
>>>>>>>      ......
>>>>>>>      Node dev-cluster2-node2 (172793105): pending
>>>>>>>      ......
>>>>>>>
>>>>>>>      The experiment was like this:
>>>>>>>      Four nodes in cluster.
>>>>>>>      On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>>>>      Thereafter, the remaining start it constantly reboot, under 
>>>>>>> various pretexts, "softly whistling", "fly low", "not a cluster 
>>>>>>> member!" ...
>>>>>>>      Then in the log fell out "Too many failures ...."
>>>>>>>      All this time in the status in crm_mon is "pending".
>>>>>>>      Depending on the wind direction changed to "UNCLEAN"
>>>>>>>      Much time has passed and I can not accurately describe the 
>>>>>>> behavior...
>>>>>>>
>>>>>>>      Now I am in the following state:
>>>>>>>      I tried locate the problem. Came here with this.
>>>>>>>      I set big value in property stonith-timeout="600s".
>>>>>>>      And got the following behavior:
>>>>>>>      1. pkill -4 corosync
>>>>>>>      2. from node with DC call my fence agent "sshbykey"
>>>>>>>      3. It sends reboot victim and waits until she comes to life again.
>>>>>>     Hmmm.... what version of pacemaker?
>>>>>>     This sounds like a timing issue that we fixed a while back
>>>>>    Was a version 1.1.11 from December 3.
>>>>>    Now try full update and retest.
>>>>   That should be recent enough.  Can you create a crm_report the next time 
>>>> you reproduce?
>>>  Of course yes. Little delay.... :)
>>>
>>>  ......
>>>  cc1: warnings being treated as errors
>>>  upstart.c: In function ‘upstart_job_property’:
>>>  upstart.c:264: error: implicit declaration of function 
>>> ‘g_variant_lookup_value’
>>>  upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
>>>  upstart.c:264: error: assignment makes pointer from integer without a cast
>>>  gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>  gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>  make[1]: *** [all-recursive] Error 1
>>>  make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>  make: *** [core] Error 1
>>>
>>>  I'm trying to solve this a problem.
>>  Do not get solved quickly...
>>
>>  
>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>  g_variant_lookup_value () Since 2.28
>>
>>  # yum list installed glib2
>>  Loaded plugins: fastestmirror, rhnplugin, security
>>  This system is receiving updates from RHN Classic or Red Hat Satellite.
>>  Loading mirror speeds from cached hostfile
>>  Installed Packages
>>  glib2.x86_64                                                              
>> 2.26.1-3.el6                                                               
>> installed
>>
>>  # cat /etc/issue
>>  CentOS release 6.5 (Final)
>>  Kernel \r on an \m
>
> Can you try this patch?
> Upstart jobs wont work, but the code will compile
>
> diff --git a/lib/services/upstart.c b/lib/services/upstart.c
> index 831e7cf..195c3a4 100644
> --- a/lib/services/upstart.c
> +++ b/lib/services/upstart.c
> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>  static char *
>  upstart_job_property(const char *obj, const gchar * iface, const char *name)
>  {
> +    char *output = NULL;
> +
> +#if !GLIB_CHECK_VERSION(2,28,0)
> +    static bool err = TRUE;
> +
> +    if(err) {
> +        crm_err("This version of glib is too old to support upstart jobs");
> +        err = FALSE;
> +    }
> +#else
>      GError *error = NULL;
>      GDBusProxy *proxy;
>      GVariant *asv = NULL;
>      GVariant *value = NULL;
>      GVariant *_ret = NULL;
> -    char *output = NULL;
>
>      crm_info("Calling GetAll on %s", obj);
>      proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * 
> iface, const char *name)
>
>      g_object_unref(proxy);
>      g_variant_unref(_ret);
> +#endif
>      return output;
>  }
>


Ok :) I patch source. 
Type "make rc" - the same error.
Make new copy via "fetch" - the same error.
It seems that if not exist ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, 
then download it. 
Otherwise use exist archive.
Cutted log .......

# make rc
make TAG=Pacemaker-1.1.11-rc3 rpm
make[1]: Entering directory `/root/ha/pacemaker'
rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.*
if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then             
                                \
            rm -f pacemaker.tar.*;                                              
\
            if [ Pacemaker-1.1.11-rc3 = dirty ]; then                           
        \
                git commit -m "DO-NOT-PUSH" -a;                                 
\
                git archive 
--prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > 
ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;       \
                git reset --mixed HEAD^;                                        
\
            else                                                                
\
                git archive 
--prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ Pacemaker-1.1.11-rc3 | 
gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;     \
            fi;                                                                 
\
            echo `date`: Rebuilt 
ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                              
       \
        else                                                                    
\
            echo `date`: Using existing tarball: 
ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                     \
        fi
Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: 
ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz
.......

Well, "make rpm" - build rpms and I create cluster.
I spent the same tests and confirmed the behavior.
crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2





>>>>>>>        Once the script makes sure that the victim will rebooted and 
>>>>>>> again available via ssh - it exit with 0.
>>>>>>>        All command is logged both the victim and the killer - all right.
>>>>>>>      4. A little later, the status of the (victim) nodes in crm_mon 
>>>>>>> changes to online.
>>>>>>>      5. BUT... not one resource don't start! Despite the fact that 
>>>>>>> "crm_simalate -sL" shows the correct resource to start:
>>>>>>>        * Start   pingCheck:3  (dev-cluster2-node2)
>>>>>>>      6. In this state, we spend the next 600 seconds.
>>>>>>>        After completing this timeout causes another node (not DC) 
>>>>>>> decides to kill again our victim.
>>>>>>>        All command again is logged both the victim and the killer - All 
>>>>>>> documented :)
>>>>>>>      7. NOW all resource started in right sequence.
>>>>>>>
>>>>>>>      I almost happy, but I do not like: two reboots and 10 minutes of 
>>>>>>> waiting ;)
>>>>>>>      And if something happens on another node, this the behavior is 
>>>>>>> superimposed on old and not any resources not start until the last node 
>>>>>>> will not reload twice.
>>>>>>>
>>>>>>>      I tried understood this behavior.
>>>>>>>      As I understand it:
>>>>>>>      1. Ultimately, in ./lib/fencing/st_client.c call 
>>>>>>> internal_stonith_action_execute().
>>>>>>>      2. It make fork and pipe from tham.
>>>>>>>      3. Async call mainloop_child_add with callback to  
>>>>>>> stonith_action_async_done.
>>>>>>>      4. Add timeout  g_timeout_add to TERM and KILL signals.
>>>>>>>
>>>>>>>      If all right must - call stonith_action_async_done, remove timeout.
>>>>>>>      For some reason this does not happen. I sit and think ....
>>>>>>>>>       At this time, there are constant re-election.
>>>>>>>>>       Also, I noticed the difference when you start pacemaker.
>>>>>>>>>       At normal startup:
>>>>>>>>>       * corosync
>>>>>>>>>       * pacemakerd
>>>>>>>>>       * attrd
>>>>>>>>>       * pengine
>>>>>>>>>       * lrmd
>>>>>>>>>       * crmd
>>>>>>>>>       * cib
>>>>>>>>>
>>>>>>>>>       When hangs start:
>>>>>>>>>       * corosync
>>>>>>>>>       * pacemakerd
>>>>>>>>>       * attrd
>>>>>>>>>       * pengine
>>>>>>>>>       * crmd
>>>>>>>>>       * lrmd
>>>>>>>>>       * cib.
>>>>>>>>      Are you referring to the order of the daemons here?
>>>>>>>>      The cib should not be at the bottom in either case.
>>>>>>>>>       Who knows who runs lrmd?
>>>>>>>>      Pacemakerd.
>>>>>>>>>       _______________________________________________
>>>>>>>>>       Pacemaker mailing list: [email protected]
>>>>>>>>>       http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>
>>>>>>>>>       Project Home: http://www.clusterlabs.org
>>>>>>>>>       Getting started: 
>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>       Bugs: http://bugs.clusterlabs.org
>>>>>>>>      ,
>>>>>>>>      _______________________________________________
>>>>>>>>      Pacemaker mailing list: [email protected]
>>>>>>>>      http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>
>>>>>>>>      Project Home: http://www.clusterlabs.org
>>>>>>>>      Getting started: 
>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>      Bugs: http://bugs.clusterlabs.org
>>>>>>>      _______________________________________________
>>>>>>>      Pacemaker mailing list: [email protected]
>>>>>>>      http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>>      Project Home: http://www.clusterlabs.org
>>>>>>>      Getting started: 
>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>      Bugs: http://bugs.clusterlabs.org
>>>>>>     ,
>>>>>>     _______________________________________________
>>>>>>     Pacemaker mailing list: [email protected]
>>>>>>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>     Getting started: 
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>>    _______________________________________________
>>>>>    Pacemaker mailing list: [email protected]
>>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>>    Project Home: http://www.clusterlabs.org
>>>>>    Getting started: 
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>   ,
>>>>   _______________________________________________
>>>>   Pacemaker mailing list: [email protected]
>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>>   Project Home: http://www.clusterlabs.org
>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>   Bugs: http://bugs.clusterlabs.org
>>>  _______________________________________________
>>>  Pacemaker mailing list: [email protected]
>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>>  _______________________________________________
>>  Pacemaker mailing list: [email protected]
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ,
> _______________________________________________
> Pacemaker mailing list: [email protected]
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] hangs pending

Reply via email to