Re: Services in READY state and MAX_READY_FAILS in less than MAX_READY_FAIL_INTERVAL

egoitz via Devel Tue, 01 Oct 2024 08:40:44 -0700

Hi Bron,

The problem happened when a process served a connection. Later it gotidle. In Ready state. If you launch a TERM to that process in READYstate at that moment (and the even the user is logged out) I have seenthat s->nreadyfails of the service get's incremented. I could should youwith a video recorded or similar. Then if you do it exactly that in moretimes than MAX_READY_FAILS in MAX_READY_FAIL_INTERVAL seconds you gotthis issue.

As commented you have to get disconnected from the account (issue a imaplogout) and leave the process as READY for reproducing it.

I don't really know how $Slot->RunCommand('cyr_info', 'proc') fills the@Procs array (with which pids) but I get the pids by using a shellscript and later calling kill to them. Concretely we end up by launchingps auxwwww | perl -ne "print if /^$USER\s+/" | egrep 'imapd|pop3d:imap|pop3: proxy' | egrep -i "PROXY1] $1@$2|PROXY2] $1@$2" | awk '{print$2}' | tr -d ' ' | xargs -I [] sh -c "echo Parando proceso --[]-- &&kill -TERM []"

That's not part of Cyrus ecosystem and perhaps we are not doing itproperly. That's why I wanted to know. Perhaps, cyr_info command doesnot return processes where a user has logged out?. And that's why youdon't get this effect?.


Perhaps it could be?.

Cheers!

El 2024-10-01 15:36, Bron Gondwana escribió:

An orderly close, or even a client disconnection which doesn't kill theprocess, shouldn't be a "FAIL". The only thing that's going toincrement the FAIL counter is an actual service crash or explicit killfrom outside the Cyrus ecosystem. Even a user_kill using SIGTERMdoesn't get logged as a FAIL I don't believe. At least, I would haveexpected to notice it since Fastmail does an external SIGTERM to everylogged-in process when a user's password is marked stolen. We use thiscode:
my @Procs = $Slot->RunCommand('cyr_info', 'proc');

my $Types = join "|", map { quotemeta } @Types;
my $TypesMatch = qr/$Types/;

my @Pids;
for (@Procs) {
# 7686 httpjmap/jmap fastmail1.internal [10.37.129.192]testuser_25763_1_1685935...@fastmaildev.com /jmap/ws/ WS
my ($Pid, $Proc, $Src, $Ip, $User, $Folder) = split / /;
next unless $User;
# Non httpjmap http processes change logged in user too quickly sincethey only# even handle a single short lived request and each request isauthenticated# as a different user. This makes them very racy to try and kill sojust ignore them
next if $Proc =~ /^http/ && $Proc !~ /^httpjmap/;
push @Pids, $Pid if $User eq $CyrusName && (!@Types || $Proc =~$TypesMatch);
}

my $PidCount = scalar @Pids;
if ($PidCount && !-f "/tmp/nokillconnections") {
kill 'TERM', @Pids;
}

print "OK $PidCount\n";
So I don't believe your case should happen frequently. Unless you havesomething that's actually crashing daemons, but in that case you havemuch bigger problems, since you'll be spending a lot of time in crashrecovery code!
I'd hope that's not common.

Bron.

On Tue, Oct 1, 2024, at 06:08, ego...@sarenet.es wrote:

Hi Bron!
First of all, don't worry about the delay!. I absolutely understand youare extremely busy and it's absolutely thankful the nice job ofFastmail with Cyrus. By the way I apologize too for the delay answeringyou to this email.
By what I have seen, this is implemented only in master. Not other tagsor branches.
I honestly think that if a process receives a TERM signal and it's notreceived (which I saw it could be checked) from the master, perhaps aspecial handling should be made (in the sense it's not a fail, it's arequired and asked proccess stop required and that should be doneorderly as it's being doing).
Imagine that you have in a morning (due to malware outbreak) 4 or 5users disconections with several processes being spawning constantly.There is a high probability of having more than MAX_READY_FAILprocesses failing in less than MAX_READY_FAIL_INTERVAL. With the ideayou said, mainly in busy machines you would be responding very slowlyduring that MAX_READY_FAIL_INTERVAL (10 seconds) to several requestswhich could finally end up in some not really required load in themachine.
Don't you think perhaps a way of handling it too, could be to reset theinteger nreadyfails in the data structure when :
* As of now : (now - s->lastreadyfail > MAX_READY_FAIL_INTERVAL
* But too when (now - last_received_sigurg_timestamp <MAX_KEEP_ZERO_NREADYFAILS_INTERVAL) whereMAX_KEEP_ZERO_NREADYFAILS_INTERVAL could be defined to 60 seconds forinstance?. And as of now, only reset the integer if it's needed inthat service?.
I say it because if you don't do something similar to the commentedideas, then how should be closed down in an ordered way processes forusers which should be disconnected with certain urgency due to nonallowed accesses with the username being detected?.
Cheers Bron!

El 2024-09-26 15:02, Bron Gondwana escribió:

Hi!  Sorry nobody got back to you earlier on this.
In an example of sychronicity, I recently wrote something touching thesame area:
https://github.com/cyrusimap/cyrus-imapd/pull/5036
I solved this a different way - if the "babysit" option is set (it'sautomatic for DAEMON block and can be added to each SERVICE) then weonly wait one set of MAX_READY_FAIL_INTERVAL before we'll startspawning processes again.
However... your fix also seems sound for the non-babysit case. Do youwant to create it as a Pull Request againsthttps://github.com/cyrusimap/cyrus-imapd [1] master? That's theeasiest for us to work with, and ensures you get proper attribution!
Cheers,

Bron.

On Tue, Sep 10, 2024, at 12:55, ego...@sarenet.es wrote:

Hi!,
Sometime ago, I wrote that when we send TERM to imapd (although itwould happen with any other I assume, pop, sieve...) procceses wewanted to exit, due to a user request for disconnecting his/hersessions, sometimes happened that was like, after that sessionsdisconnection (TERM to imapd processes) no enough processes wherebecome spawned newly. Only sometimes when very few processes needed tobe killed.
I have been able to reproduce it. If a user has connected (becauseproctitle() has set it in the name) and later in very few time "leaves"(logouts for instance) and then the process moves to READY state if youkill with TERM more than MAX_READY_FAILS units of that process in lessthan MAX_READY_FAIL_INTERVAL, master won't spawn new processes as it'swritten in master.c in lines near 1100 in reap_child() function.
It's suggested to launch a SIGHUP to master for activating again theservice, but it can't be enabled again because the service seems tohave removed from the s data structure but not stopped. Due to that nonprocess stop, when new imapd attemps to load (in service_create() ) itcan't be created because the socket is still in use.
So, for ensuring this is correct, I have written the following patchfor master.c and that I tested on 3.0.15 :
root@debugcyrus:/usr/ports/mail/cyrus-imapd30 # diff -uwork/cyrus-imapd-3.0.15/master/master.c /master.c-definitivo--- work/cyrus-imapd-3.0.15/master/master.c 2021-03-0904:27:45.000000000 +0100
+++ /master.c-definitivo    2024-09-10 18:36:49.797581000 +0200
@@ -129,6 +129,11 @@
};

static int verbose = 0;
+
+/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
+static int gotsigurg = 0;
+/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
+
static int listen_queue_backlog = 32;
static int pidfd = -1;

@@ -1047,6 +1052,22 @@
}
}

+/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
+static void sigurg_handler(int sig __attribute__((unused)))
+{
+    syslog(LOG_DEBUG, "URG CAPTURADO!!!!!!");
+
+    if (gotsigurg)
+    {
+        gotsigurg = 0;
+    }
+    else
+    {
+        gotsigurg = 1;
+    }
+}
+/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
+
static void reap_child(void)
{
int status;
@@ -1094,10 +1115,24 @@
"terminated abnormally",
SERVICEPARAM(s->name),
SERVICEPARAM(s->familyname), pid);
- if (now - s->lastreadyfail >MAX_READY_FAIL_INTERVAL) {
+
+ syslog(LOG_DEBUG, "Senal URG vale....--%d--",gotsigurg);
+
+ if ((now - s->lastreadyfail >MAX_READY_FAIL_INTERVAL) || (gotsigurg))
+                        {
s->nreadyfails = 0;
+
+                            if (gotsigurg)
+                            {
+                                syslog(LOG_DEBUG, "RESETEANDO....");
+                                syslog(LOG_DEBUG, "RESETEADO....");
+                            }
}
+
+ syslog(LOG_DEBUG, "too many failures forservice %s/%s, resetting counters due to SIGURG received in Cyrusmaster. El got vale--%d--",SERVICEPARAM(s->name),SERVICEPARAM(s->familyname),gotsigurg);
+
s->lastreadyfail = now;
+
if (++s->nreadyfails >= MAX_READY_FAILS && s->exec) {
syslog(LOG_ERR, "too many failures for "
"service %s/%s, disabling until next SIGHUP",
@@ -1305,11 +1340,18 @@
sigemptyset(&action.sa_mask);

action.sa_handler = sighup_handler;
+
#ifdef SA_RESTART
action.sa_flags |= SA_RESTART;
#endif
if (sigaction(SIGHUP, &action, NULL) < 0)
fatalf(1, "unable to install signal handler for SIGHUP: %m");
+
+    /* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
+    action.sa_handler = sigurg_handler;
+    if (sigaction(SIGURG, &action, NULL) < 0)
+        fatalf(1, "unable to install signal handler for SIGURG: %m");
+    /* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */

action.sa_handler = sigalrm_handler;
if (sigaction(SIGALRM, &action, NULL) < 0)
root@debugcyrus:/usr/ports/mail/cyrus-imapd30 #
So by what I have seen Cyrus wrapper is written (and the way it handlesservices), I think that the possible solutions could be :
* Send a kill(pid,SIGTERM) for ensuring the process die beforeforgetting from s structure.* Do something similar as I have done, which gives you a time windowfor having more failures than expected for some seconds and which latercould be undone with for instance the same signal sending.
Reproduced and something proposed at least.

What do you think about it? :)

Cheers!

Egoitz Aurrekoetxea
Departamento de sistemas
94 - 420 94 70
ego...@sarenet.es
www.sarenet.es [2]

Parque Tecnológico. Edificio 103
48170 Zamudio (Bizkaia)
Antes de imprimir este correo electrónico piense si es necesariohacerlo.
--
Bron Gondwana
br...@fastmail.fm
Cyrus [3] / Devel / see discussions [4] + participants [5] + deliveryoptions [6] Permalink [7]


Egoitz Aurrekoetxea
Departamento de sistemas
94 - 420 94 70
ego...@sarenet.es
www.sarenet.es [2]

Parque Tecnológico. Edificio 103
48170 Zamudio (Bizkaia)

Antes de imprimir este correo electrónico piense si es necesariohacerlo.


--
  Bron Gondwana
  br...@fastmail.fm



Links:
------
[1] https://github.com/cyrusimap/cyrus-imapd/pull/5036
[2] http://www.sarenet.es
[3] https://cyrus.topicbox.com/latest
[4] https://cyrus.topicbox.com/groups/devel
[5] https://cyrus.topicbox.com/groups/devel/members
[6] https://cyrus.topicbox.com/groups/devel/subscription

[7]https://cyrus.topicbox.com/groups/devel/Tf9f7cf579fff1397-M0cfe9d73756070115586cdc5




Egoitz Aurrekoetxea

Departamento de sistemas

94 - 420 94 70 | ego...@sarenet.es

S A R E N E T   S.A.U.

Parque Tecnológico. Edificio 103 | 48170 Zamudio (Bizkaia) - www.sarenet.es



Antes de imprimir este correo electrónico piense si es necesario hacerlo.






------------------------------------------
Cyrus: Devel
Permalink: 
https://cyrus.topicbox.com/groups/devel/Tf9f7cf579fff1397-M05286eea5b5ad35767b81f2f
Delivery options: https://cyrus.topicbox.com/groups/devel/subscription

Re: Services in READY state and MAX_READY_FAILS in less than MAX_READY_FAIL_INTERVAL

Reply via email to