Re: Services in READY state and MAX_READY_FAILS in less than MAX_READY_FAIL_INTERVAL

Bron Gondwana Tue, 01 Oct 2024 07:55:16 -0700

An orderly close, or even a client disconnection which doesn't kill the 
process, shouldn't be a "FAIL".  The only thing that's going to increment the 
FAIL counter is an actual service crash or explicit kill from outside the Cyrus 
ecosystem.  Even a user_kill using SIGTERM doesn't get logged as a FAIL I don't 
believe.  At least, I would have expected to notice it since Fastmail does an 
external SIGTERM to every logged-in process when a user's password is marked 
stolen.  We use this code:


my @Procs = $Slot->RunCommand('cyr_info', 'proc');

my $Types = join "|", map { quotemeta } @Types;
my $TypesMatch = qr/$Types/;

my @Pids;
for (@Procs) {
  # 7686 httpjmap/jmap fastmail1.internal [10.37.129.192] 
testuser_25763_1_1685935...@fastmaildev.com /jmap/ws/ WS
  my ($Pid, $Proc, $Src, $Ip, $User, $Folder) = split / /;
  next unless $User;
  # Non httpjmap http processes change logged in user too quickly since they 
only
  #  even handle a single short lived request and each request is authenticated
  #  as a different user. This makes them very racy to try and kill so just 
ignore them
  next if $Proc =~ /^http/ && $Proc !~ /^httpjmap/;
  push @Pids, $Pid if $User eq $CyrusName && (!@Types || $Proc =~ $TypesMatch);
}

my $PidCount = scalar @Pids;
if ($PidCount && !-f "/tmp/nokillconnections") {
  kill 'TERM', @Pids;
}

print "OK $PidCount\n";

So I don't believe your case should happen frequently.  Unless you have 
something that's actually crashing daemons, but in that case you have much 
bigger problems, since you'll be spending a lot of time in crash recovery code!

I'd hope that's not common.

Bron.

On Tue, Oct 1, 2024, at 06:08, ego...@sarenet.es wrote:
> Hi Bron!
> 
> 
> 
> First of all, don't worry about the delay!. I absolutely understand you are 
> extremely busy and it's absolutely thankful the nice job of Fastmail with 
> Cyrus. By the way I apologize too for the delay answering you to this email. 
> 
> By what I have seen, this is implemented only in master. Not other tags or 
> branches. 
> 
> I honestly think that if a process receives a TERM signal and it's not 
> received (which I saw it could be checked) from the master, perhaps a special 
> handling should be made (in the sense it's not a fail, it's a required and 
> asked proccess stop required and that should be done orderly as it's being 
> doing).
> 
> Imagine that you have in a morning (due to malware outbreak) 4 or 5 users 
> disconections with several processes being spawning constantly. There is a 
> high probability of having more than MAX_READY_FAIL processes failing in less 
> than MAX_READY_FAIL_INTERVAL. With the idea you said, mainly in busy machines 
> you would be responding very slowly during that MAX_READY_FAIL_INTERVAL (10 
> seconds) to several requests which could finally end up in some not really 
> required load in the machine.
> 
> Don't you think perhaps a way of handling it too, could be to reset the 
> integer nreadyfails in the data structure when :
> 
>  • As of now : (now - s->lastreadyfail > MAX_READY_FAIL_INTERVAL
>  • But too when (now - last_received_sigurg_timestamp < 
> MAX_KEEP_ZERO_NREADYFAILS_INTERVAL) where MAX_KEEP_ZERO_NREADYFAILS_INTERVAL 
> could be defined to 60 seconds for instance?. And as of now, only reset  the 
> integer if it's needed in that service?.
> I say it because if you don't do something similar to the commented ideas, 
> then how should be closed down in an ordered way processes for users which 
> should be disconnected with certain urgency due to non allowed accesses with 
> the username being detected?.
> 
> 
> 
> Cheers Bron!
> 
> 
> 
> El 2024-09-26 15:02, Bron Gondwana escribió:
> 
>> Hi!  Sorry nobody got back to you earlier on this.
>>  
>> In an example of sychronicity, I recently wrote something touching the same 
>> area:
>>  
>> https://github.com/cyrusimap/cyrus-imapd/pull/5036
>>  
>> I solved this a different way - if the "babysit" option is set (it's 
>> automatic for DAEMON block and can be added to each SERVICE) then we only 
>> wait one set of MAX_READY_FAIL_INTERVAL before we'll start spawning 
>> processes again.
>>  
>> However... your fix also seems sound for the non-babysit case.  Do you want 
>> to create it as a Pull Request against 
>> https://github.com/cyrusimap/cyrus-imapd 
>> <https://github.com/cyrusimap/cyrus-imapd/pull/5036> master?  That's the 
>> easiest for us to work with, and ensures you get proper attribution!
>> 
>> Cheers,
>>  
>> Bron.
>>  
>> On Tue, Sep 10, 2024, at 12:55, ego...@sarenet.es wrote:
>>> Hi!,
>>> 
>>> 
>>> 
>>> Sometime ago, I wrote that when we send TERM to imapd (although it would 
>>> happen with any other I assume, pop, sieve...) procceses we wanted to exit, 
>>> due to a user request for disconnecting his/her sessions, sometimes 
>>> happened that was like, after that sessions disconnection (TERM to imapd 
>>> processes) no enough processes where become spawned newly. Only sometimes 
>>> when very few processes needed to be killed.
>>> 
>>> I have been able to reproduce it. If a user has connected (because 
>>> proctitle() has set it in the name) and later in very few time "leaves" 
>>> (logouts for instance) and then the process moves to READY state if you 
>>> kill with TERM more than MAX_READY_FAILS units of that process in less than 
>>> MAX_READY_FAIL_INTERVAL, master won't spawn new processes as it's written 
>>> in master.c in lines near 1100 in reap_child() function.
>>> 
>>> It's suggested to launch a SIGHUP to master for activating again the 
>>> service, but it can't be enabled again because the service seems to have 
>>> removed from the s data structure but not stopped. Due to that non process 
>>> stop, when new imapd attemps to load (in service_create() ) it can't be 
>>> created because the socket is still in use.
>>> 
>>> So, for ensuring this is correct, I have written the following patch for 
>>> master.c and that I tested on 3.0.15 : 
>>> 
>>> 
>>> 
>>> root@debugcyrus:/usr/ports/mail/cyrus-imapd30 # diff -u 
>>> work/cyrus-imapd-3.0.15/master/master.c /master.c-definitivo
>>> --- work/cyrus-imapd-3.0.15/master/master.c    2021-03-09 
>>> 04:27:45.000000000 +0100
>>> +++ /master.c-definitivo    2024-09-10 18:36:49.797581000 +0200
>>> @@ -129,6 +129,11 @@
>>>  };
>>>  
>>>  static int verbose = 0;
>>> +
>>> +/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
>>> +static int gotsigurg = 0;
>>> +/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
>>> +
>>>  static int listen_queue_backlog = 32;
>>>  static int pidfd = -1;
>>>  
>>> @@ -1047,6 +1052,22 @@
>>>      }
>>>  }
>>>  
>>> +/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
>>> +static void sigurg_handler(int sig __attribute__((unused)))
>>> +{
>>> +    syslog(LOG_DEBUG, "URG CAPTURADO!!!!!!");
>>> +
>>> +    if (gotsigurg)
>>> +    {
>>> +        gotsigurg = 0;
>>> +    }
>>> +    else
>>> +    {
>>> +        gotsigurg = 1;
>>> +    }
>>> +}
>>> +/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
>>> +
>>>  static void reap_child(void)
>>>  {
>>>      int status;
>>> @@ -1094,10 +1115,24 @@
>>>                                 "terminated abnormally",
>>>                                 SERVICEPARAM(s->name),
>>>                                 SERVICEPARAM(s->familyname), pid);
>>> -                        if (now - s->lastreadyfail > 
>>> MAX_READY_FAIL_INTERVAL) {
>>> +
>>> +                        syslog(LOG_DEBUG, "Senal URG vale.... 
>>> --%d--",gotsigurg);                        
>>> +
>>> +                        if ((now - s->lastreadyfail > 
>>> MAX_READY_FAIL_INTERVAL) || (gotsigurg))
>>> +                        {
>>>                              s->nreadyfails = 0;
>>> +
>>> +                            if (gotsigurg)
>>> +                            {
>>> +                                syslog(LOG_DEBUG, "RESETEANDO....");
>>> +                                syslog(LOG_DEBUG, "RESETEADO....");
>>> +                            }
>>>                          }
>>> +
>>> +                        syslog(LOG_DEBUG, "too many failures for service 
>>> %s/%s, resetting counters due to SIGURG received in Cyrus master. El got 
>>> vale --%d--",SERVICEPARAM(s->name),SERVICEPARAM(s->familyname),gotsigurg);
>>> +
>>>                          s->lastreadyfail = now;
>>> +
>>>                          if (++s->nreadyfails >= MAX_READY_FAILS && 
>>> s->exec) {
>>>                              syslog(LOG_ERR, "too many failures for "
>>>                                     "service %s/%s, disabling until next 
>>> SIGHUP",
>>> @@ -1305,11 +1340,18 @@
>>>      sigemptyset(&action.sa_mask);
>>>  
>>>      action.sa_handler = sighup_handler;
>>> +
>>>  #ifdef SA_RESTART
>>>      action.sa_flags |= SA_RESTART;
>>>  #endif
>>>      if (sigaction(SIGHUP, &action, NULL) < 0)
>>>          fatalf(1, "unable to install signal handler for SIGHUP: %m");
>>> +
>>> +    /* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
>>> +    action.sa_handler = sigurg_handler;
>>> +    if (sigaction(SIGURG, &action, NULL) < 0)
>>> +        fatalf(1, "unable to install signal handler for SIGURG: %m");
>>> +    /* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
>>>  
>>>      action.sa_handler = sigalrm_handler;
>>>      if (sigaction(SIGALRM, &action, NULL) < 0)
>>> root@debugcyrus:/usr/ports/mail/cyrus-imapd30 # 
>>> 
>>> 
>>> 
>>> 
>>> So by what I have seen Cyrus wrapper is written (and the way it handles 
>>> services), I think that the possible solutions could be :
>>> 
>>>  •  Send a kill(pid,SIGTERM) for ensuring the process die before forgetting 
>>> from s structure.
>>>  • Do something similar as I have done, which gives you a time window for 
>>> having more failures than expected for some seconds and which later could 
>>> be undone with for instance the same signal sending.
>>> 
>>> 
>>> Reproduced and something proposed at least.
>>> 
>>> 
>>> 
>>> What do you think about it? :)
>>> 
>>> 
>>> 
>>> Cheers!
>>> 
>>> 
>>> 
>>> 
>>> 
>>>  
>>>  
>>>  
>>> sarenet
>>> *Egoitz Aurrekoetxea*
>>> Departamento de sistemas
>>> 94 - 420 94 70
>>> ego...@sarenet.es
>>> www.sarenet.es
>>> Parque Tecnológico. Edificio 103
>>> 48170 Zamudio (Bizkaia)
>>> servicios sarenet
>>> Antes de imprimir este correo electrónico piense si es necesario hacerlo.
>>> 
>>> 
>>  
>> -- 
>>   Bron Gondwana
>>   br...@fastmail.fm
>>  
>>  
>> *Cyrus <https://cyrus.topicbox.com/latest>* / Devel / see discussions 
>> <https://cyrus.topicbox.com/groups/devel> + participants 
>> <https://cyrus.topicbox.com/groups/devel/members> + delivery options 
>> <https://cyrus.topicbox.com/groups/devel/subscription> Permalink 
>> <https://cyrus.topicbox.com/groups/devel/Tf9f7cf579fff1397-M0cfe9d73756070115586cdc5>
> 
> 
> sarenet
> *Egoitz Aurrekoetxea*
> Departamento de sistemas
> 94 - 420 94 70
> ego...@sarenet.es
> www.sarenet.es
> Parque Tecnológico. Edificio 103
> 48170 Zamudio (Bizkaia)
> servicios sarenet
> Antes de imprimir este correo electrónico piense si es necesario hacerlo.

-- 
  Bron Gondwana
  br...@fastmail.fm


------------------------------------------
Cyrus: Devel
Permalink: 
https://cyrus.topicbox.com/groups/devel/Tf9f7cf579fff1397-M1c1ea091515d2c811550d096
Delivery options: https://cyrus.topicbox.com/groups/devel/subscription

Re: Services in READY state and MAX_READY_FAILS in less than MAX_READY_FAIL_INTERVAL

Reply via email to