An orderly close, or even a client disconnection which doesn't kill the process, shouldn't be a "FAIL". The only thing that's going to increment the FAIL counter is an actual service crash or explicit kill from outside the Cyrus ecosystem. Even a user_kill using SIGTERM doesn't get logged as a FAIL I don't believe. At least, I would have expected to notice it since Fastmail does an external SIGTERM to every logged-in process when a user's password is marked stolen. We use this code:
my @Procs = $Slot->RunCommand('cyr_info', 'proc'); my $Types = join "|", map { quotemeta } @Types; my $TypesMatch = qr/$Types/; my @Pids; for (@Procs) { # 7686 httpjmap/jmap fastmail1.internal [10.37.129.192] testuser_25763_1_1685935...@fastmaildev.com /jmap/ws/ WS my ($Pid, $Proc, $Src, $Ip, $User, $Folder) = split / /; next unless $User; # Non httpjmap http processes change logged in user too quickly since they only # even handle a single short lived request and each request is authenticated # as a different user. This makes them very racy to try and kill so just ignore them next if $Proc =~ /^http/ && $Proc !~ /^httpjmap/; push @Pids, $Pid if $User eq $CyrusName && (!@Types || $Proc =~ $TypesMatch); } my $PidCount = scalar @Pids; if ($PidCount && !-f "/tmp/nokillconnections") { kill 'TERM', @Pids; } print "OK $PidCount\n"; So I don't believe your case should happen frequently. Unless you have something that's actually crashing daemons, but in that case you have much bigger problems, since you'll be spending a lot of time in crash recovery code! I'd hope that's not common. Bron. On Tue, Oct 1, 2024, at 06:08, ego...@sarenet.es wrote: > Hi Bron! > > > > First of all, don't worry about the delay!. I absolutely understand you are > extremely busy and it's absolutely thankful the nice job of Fastmail with > Cyrus. By the way I apologize too for the delay answering you to this email. > > By what I have seen, this is implemented only in master. Not other tags or > branches. > > I honestly think that if a process receives a TERM signal and it's not > received (which I saw it could be checked) from the master, perhaps a special > handling should be made (in the sense it's not a fail, it's a required and > asked proccess stop required and that should be done orderly as it's being > doing). > > Imagine that you have in a morning (due to malware outbreak) 4 or 5 users > disconections with several processes being spawning constantly. There is a > high probability of having more than MAX_READY_FAIL processes failing in less > than MAX_READY_FAIL_INTERVAL. With the idea you said, mainly in busy machines > you would be responding very slowly during that MAX_READY_FAIL_INTERVAL (10 > seconds) to several requests which could finally end up in some not really > required load in the machine. > > Don't you think perhaps a way of handling it too, could be to reset the > integer nreadyfails in the data structure when : > > • As of now : (now - s->lastreadyfail > MAX_READY_FAIL_INTERVAL > • But too when (now - last_received_sigurg_timestamp < > MAX_KEEP_ZERO_NREADYFAILS_INTERVAL) where MAX_KEEP_ZERO_NREADYFAILS_INTERVAL > could be defined to 60 seconds for instance?. And as of now, only reset the > integer if it's needed in that service?. > I say it because if you don't do something similar to the commented ideas, > then how should be closed down in an ordered way processes for users which > should be disconnected with certain urgency due to non allowed accesses with > the username being detected?. > > > > Cheers Bron! > > > > El 2024-09-26 15:02, Bron Gondwana escribió: > >> Hi! Sorry nobody got back to you earlier on this. >> >> In an example of sychronicity, I recently wrote something touching the same >> area: >> >> https://github.com/cyrusimap/cyrus-imapd/pull/5036 >> >> I solved this a different way - if the "babysit" option is set (it's >> automatic for DAEMON block and can be added to each SERVICE) then we only >> wait one set of MAX_READY_FAIL_INTERVAL before we'll start spawning >> processes again. >> >> However... your fix also seems sound for the non-babysit case. Do you want >> to create it as a Pull Request against >> https://github.com/cyrusimap/cyrus-imapd >> <https://github.com/cyrusimap/cyrus-imapd/pull/5036> master? That's the >> easiest for us to work with, and ensures you get proper attribution! >> >> Cheers, >> >> Bron. >> >> On Tue, Sep 10, 2024, at 12:55, ego...@sarenet.es wrote: >>> Hi!, >>> >>> >>> >>> Sometime ago, I wrote that when we send TERM to imapd (although it would >>> happen with any other I assume, pop, sieve...) procceses we wanted to exit, >>> due to a user request for disconnecting his/her sessions, sometimes >>> happened that was like, after that sessions disconnection (TERM to imapd >>> processes) no enough processes where become spawned newly. Only sometimes >>> when very few processes needed to be killed. >>> >>> I have been able to reproduce it. If a user has connected (because >>> proctitle() has set it in the name) and later in very few time "leaves" >>> (logouts for instance) and then the process moves to READY state if you >>> kill with TERM more than MAX_READY_FAILS units of that process in less than >>> MAX_READY_FAIL_INTERVAL, master won't spawn new processes as it's written >>> in master.c in lines near 1100 in reap_child() function. >>> >>> It's suggested to launch a SIGHUP to master for activating again the >>> service, but it can't be enabled again because the service seems to have >>> removed from the s data structure but not stopped. Due to that non process >>> stop, when new imapd attemps to load (in service_create() ) it can't be >>> created because the socket is still in use. >>> >>> So, for ensuring this is correct, I have written the following patch for >>> master.c and that I tested on 3.0.15 : >>> >>> >>> >>> root@debugcyrus:/usr/ports/mail/cyrus-imapd30 # diff -u >>> work/cyrus-imapd-3.0.15/master/master.c /master.c-definitivo >>> --- work/cyrus-imapd-3.0.15/master/master.c 2021-03-09 >>> 04:27:45.000000000 +0100 >>> +++ /master.c-definitivo 2024-09-10 18:36:49.797581000 +0200 >>> @@ -129,6 +129,11 @@ >>> }; >>> >>> static int verbose = 0; >>> + >>> +/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */ >>> +static int gotsigurg = 0; >>> +/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */ >>> + >>> static int listen_queue_backlog = 32; >>> static int pidfd = -1; >>> >>> @@ -1047,6 +1052,22 @@ >>> } >>> } >>> >>> +/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */ >>> +static void sigurg_handler(int sig __attribute__((unused))) >>> +{ >>> + syslog(LOG_DEBUG, "URG CAPTURADO!!!!!!"); >>> + >>> + if (gotsigurg) >>> + { >>> + gotsigurg = 0; >>> + } >>> + else >>> + { >>> + gotsigurg = 1; >>> + } >>> +} >>> +/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */ >>> + >>> static void reap_child(void) >>> { >>> int status; >>> @@ -1094,10 +1115,24 @@ >>> "terminated abnormally", >>> SERVICEPARAM(s->name), >>> SERVICEPARAM(s->familyname), pid); >>> - if (now - s->lastreadyfail > >>> MAX_READY_FAIL_INTERVAL) { >>> + >>> + syslog(LOG_DEBUG, "Senal URG vale.... >>> --%d--",gotsigurg); >>> + >>> + if ((now - s->lastreadyfail > >>> MAX_READY_FAIL_INTERVAL) || (gotsigurg)) >>> + { >>> s->nreadyfails = 0; >>> + >>> + if (gotsigurg) >>> + { >>> + syslog(LOG_DEBUG, "RESETEANDO...."); >>> + syslog(LOG_DEBUG, "RESETEADO...."); >>> + } >>> } >>> + >>> + syslog(LOG_DEBUG, "too many failures for service >>> %s/%s, resetting counters due to SIGURG received in Cyrus master. El got >>> vale --%d--",SERVICEPARAM(s->name),SERVICEPARAM(s->familyname),gotsigurg); >>> + >>> s->lastreadyfail = now; >>> + >>> if (++s->nreadyfails >= MAX_READY_FAILS && >>> s->exec) { >>> syslog(LOG_ERR, "too many failures for " >>> "service %s/%s, disabling until next >>> SIGHUP", >>> @@ -1305,11 +1340,18 @@ >>> sigemptyset(&action.sa_mask); >>> >>> action.sa_handler = sighup_handler; >>> + >>> #ifdef SA_RESTART >>> action.sa_flags |= SA_RESTART; >>> #endif >>> if (sigaction(SIGHUP, &action, NULL) < 0) >>> fatalf(1, "unable to install signal handler for SIGHUP: %m"); >>> + >>> + /* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */ >>> + action.sa_handler = sigurg_handler; >>> + if (sigaction(SIGURG, &action, NULL) < 0) >>> + fatalf(1, "unable to install signal handler for SIGURG: %m"); >>> + /* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */ >>> >>> action.sa_handler = sigalrm_handler; >>> if (sigaction(SIGALRM, &action, NULL) < 0) >>> root@debugcyrus:/usr/ports/mail/cyrus-imapd30 # >>> >>> >>> >>> >>> So by what I have seen Cyrus wrapper is written (and the way it handles >>> services), I think that the possible solutions could be : >>> >>> • Send a kill(pid,SIGTERM) for ensuring the process die before forgetting >>> from s structure. >>> • Do something similar as I have done, which gives you a time window for >>> having more failures than expected for some seconds and which later could >>> be undone with for instance the same signal sending. >>> >>> >>> Reproduced and something proposed at least. >>> >>> >>> >>> What do you think about it? :) >>> >>> >>> >>> Cheers! >>> >>> >>> >>> >>> >>> >>> >>> >>> sarenet >>> *Egoitz Aurrekoetxea* >>> Departamento de sistemas >>> 94 - 420 94 70 >>> ego...@sarenet.es >>> www.sarenet.es >>> Parque Tecnológico. Edificio 103 >>> 48170 Zamudio (Bizkaia) >>> servicios sarenet >>> Antes de imprimir este correo electrónico piense si es necesario hacerlo. >>> >>> >> >> -- >> Bron Gondwana >> br...@fastmail.fm >> >> >> *Cyrus <https://cyrus.topicbox.com/latest>* / Devel / see discussions >> <https://cyrus.topicbox.com/groups/devel> + participants >> <https://cyrus.topicbox.com/groups/devel/members> + delivery options >> <https://cyrus.topicbox.com/groups/devel/subscription> Permalink >> <https://cyrus.topicbox.com/groups/devel/Tf9f7cf579fff1397-M0cfe9d73756070115586cdc5> > > > sarenet > *Egoitz Aurrekoetxea* > Departamento de sistemas > 94 - 420 94 70 > ego...@sarenet.es > www.sarenet.es > Parque Tecnológico. Edificio 103 > 48170 Zamudio (Bizkaia) > servicios sarenet > Antes de imprimir este correo electrónico piense si es necesario hacerlo. -- Bron Gondwana br...@fastmail.fm ------------------------------------------ Cyrus: Devel Permalink: https://cyrus.topicbox.com/groups/devel/Tf9f7cf579fff1397-M1c1ea091515d2c811550d096 Delivery options: https://cyrus.topicbox.com/groups/devel/subscription