That's good! There's even a simpler way that what we're using now
(though we haven't switched yet), which is to use cyr_deny. It can
kill all existing connections for a user as it adds them to the deny
database. In our case that's not quite what we want though - we just
want to drop existing connections since new connections won't be able
to authenticate with the expired credential. We don't want to block
the user.
Bron.
On Wed, Oct 2, 2024, at 04:41, ego...@sarenet.es wrote:
Hi!
Ok, I think it's solved due to the code you show me yesterday of how
are closing connections at your Perl script.
Concretely I share with you a video of our testing env where you could
check could do I reproduce the issue.
https://sarenetes.storage.saremail.com/index.php/s/sepKebiz8nog8xM
===> Password : XMXbqq44yn
Using totally vanila sources of a 3.0.15 version, you login and logout.
Then the process title does not get update at least in FreeBSD but I
assume it happens in all platforms as that is usually done with a
syscall for that purpose. Anyway, even when the user is not logged we
still kill it as we do the commented ps. Then this effect happens.
We were doing it wrong. We should not see what to kill by issuing a ps
and seeing the proctitle. We should use the same way as you the
cyr_info command which I assume it uses a updated list of pids in an
internal data structure and killing that pids indicated by cyr_info,
this shouldn't happen because even, if you launch as you stated a TERM
to a process where a user is logged, that process does not log :
Oct 2 09:40:39 debugcyrus master[729]: service imap/ipv4 pid 743 in
READY state: terminated abnormally
it logs instead (which does different things) :
Oct 2 10:23:54 debugcyrus imap[1434]: graceful shutdown initiated by
unexpected process 50285
and
Oct 2 10:23:54 debugcyrus master[1427]: process type:SERVICE name:imap
path:/usr/local/cyrus/libexec/imapd age:50.064s pid:1434 exited, status
75
So the problem seemed it was that we didn't know the proccess title was
not get updated until the next command received in a request in the pid
and we killed it. Apart from that we have some more outdated info,
because we make some sleep over there before the kill for that PIDs we
get on the ps output.
I'll replace the ps with a cyr_proc proc and I assume we won't suffer
more issues.
This was all that was happenning. Thank you so much for your time and
sorry for the noise, but sometime we commented at the user list we did
a TERM to the proccesses and as you told us it was fine, we though all
should be fine. Yesterday, when you saw me the Perl where you launch it
seems in shell the cyr_info I noticed where the problem was.
So it's important for anybody I assume to keep in mind that at least,
in FreeBSD, the process title doesn't get updated when the protocol
command ends. It stays that process title until the process receives
next command of the protocol it is answering for (or the protocol that
kind of binary provides).
Sorry for the noise...
Cheers!!
El 2024-10-01 17:59, egoitz via Devel escribió:
I think perhaps the difference is that you launch a cyr_info in the
slot and then you launch a TERM to that pids....
We do that without cyr_info and with the pid proctitles...
Perhaps you have a faster refresh or a more recent set of PIDs on your
call with cyr_info because perhaps is far more faster than the ps we do
and then we end up by launching TERM to processes in READY state
without a user connected?. I think it didn't get reproduced if we
launched it to a user logged in. Just to a process where the user has
been logged in and now is logged out but whose pid we have fetched
before it has logged out and before getting the process in READY state.
We normally don't have issues when :
- The user has very few processes to kill (less than the
MAX_READY_FAILS).
- The user has lots of processes but it's logged in.
We didn't really know how to reproduce it and finally I ended up by
noting I needed to be logged out from imap for reproducing it. Perhaps
do a login, whatever command, logout and then TERM all that processes.
I could try repeating it tomorrow but I'm pretty sure that were the
requirements.
Perhaps the proper way of doing it is with cyr_info instead of with ps.
I'm thinking now, perhaps, the logout command does not refresh the
proctitle (until the next login or similar) and the last command (the
logout) stais as proccess title and that's why we call a kill TERM for
it?. Perhaps cyr_info doesn't show that processes?.
I'll try to see if there's this difference tomorrow.
Fortunately we don't normally have the need of doing crash recoveries
and when need, we set another slot as master of the mailboxes...
I'll update you tomorrow,
Cheers
El 2024-10-01 17:39, egoitz via Devel escribió:
Hi Bron,
The problem happened when a process served a connection. Later it got
idle. In Ready state. If you launch a TERM to that process in READY
state at that moment (and the even the user is logged out) I have seen
that s->nreadyfails of the service get's incremented. I could should
you with a video recorded or similar. Then if you do it exactly that in
more times than MAX_READY_FAILS in MAX_READY_FAIL_INTERVAL seconds you
got this issue.
As commented you have to get disconnected from the account (issue a
imap logout) and leave the process as READY for reproducing it.
I don't really know how $Slot->RunCommand('cyr_info', 'proc') fills the
@Procs array (with which pids) but I get the pids by using a shell
script and later calling kill to them. Concretely we end up by
launching :
ps auxwwww | perl -ne "print if /^$USER\s+/" | egrep 'imapd|pop3d:
imap|pop3: proxy' | egrep -i "PROXY1] $1@$2|PROXY2] $1@$2" | awk
'{print $2}' | tr -d ' ' | xargs -I [] sh -c "echo Parando proceso
--[]-- && kill -TERM []"
That's not part of Cyrus ecosystem and perhaps we are not doing it
properly. That's why I wanted to know. Perhaps, cyr_info command does
not return processes where a user has logged out?. And that's why you
don't get this effect?.
Perhaps it could be?.
Cheers!
El 2024-10-01 15:36, Bron Gondwana escribió:
An orderly close, or even a client disconnection which doesn't kill the
process, shouldn't be a "FAIL". The only thing that's going to
increment the FAIL counter is an actual service crash or explicit kill
from outside the Cyrus ecosystem. Even a user_kill using SIGTERM
doesn't get logged as a FAIL I don't believe. At least, I would have
expected to notice it since Fastmail does an external SIGTERM to every
logged-in process when a user's password is marked stolen. We use this
code:
my @Procs = $Slot->RunCommand('cyr_info', 'proc');
my $Types = join "|", map { quotemeta } @Types;
my $TypesMatch = qr/$Types/;
my @Pids;
for (@Procs) {
# 7686 httpjmap/jmap fastmail1.internal [10.37.129.192]
testuser_25763_1_1685935...@fastmaildev.com /jmap/ws/ WS
my ($Pid, $Proc, $Src, $Ip, $User, $Folder) = split / /;
next unless $User;
# Non httpjmap http processes change logged in user too quickly since
they only
# even handle a single short lived request and each request is
authenticated
# as a different user. This makes them very racy to try and kill so
just ignore them
next if $Proc =~ /^http/ && $Proc !~ /^httpjmap/;
push @Pids, $Pid if $User eq $CyrusName && (!@Types || $Proc =~
$TypesMatch);
}
my $PidCount = scalar @Pids;
if ($PidCount && !-f "/tmp/nokillconnections") {
kill 'TERM', @Pids;
}
print "OK $PidCount\n";
So I don't believe your case should happen frequently. Unless you have
something that's actually crashing daemons, but in that case you have
much bigger problems, since you'll be spending a lot of time in crash
recovery code!
I'd hope that's not common.
Bron.
On Tue, Oct 1, 2024, at 06:08, ego...@sarenet.es wrote:
Hi Bron!
First of all, don't worry about the delay!. I absolutely understand you
are extremely busy and it's absolutely thankful the nice job of
Fastmail with Cyrus. By the way I apologize too for the delay answering
you to this email.
By what I have seen, this is implemented only in master. Not other tags
or branches.
I honestly think that if a process receives a TERM signal and it's not
received (which I saw it could be checked) from the master, perhaps a
special handling should be made (in the sense it's not a fail, it's a
required and asked proccess stop required and that should be done
orderly as it's being doing).
Imagine that you have in a morning (due to malware outbreak) 4 or 5
users disconections with several processes being spawning constantly.
There is a high probability of having more than MAX_READY_FAIL
processes failing in less than MAX_READY_FAIL_INTERVAL. With the idea
you said, mainly in busy machines you would be responding very slowly
during that MAX_READY_FAIL_INTERVAL (10 seconds) to several requests
which could finally end up in some not really required load in the
machine.
Don't you think perhaps a way of handling it too, could be to reset the
integer nreadyfails in the data structure when :
* As of now : (now - s->lastreadyfail > MAX_READY_FAIL_INTERVAL
* But too when (now - last_received_sigurg_timestamp <
MAX_KEEP_ZERO_NREADYFAILS_INTERVAL) where
MAX_KEEP_ZERO_NREADYFAILS_INTERVAL could be defined to 60 seconds for
instance?. And as of now, only reset the integer if it's needed in
that service?.
I say it because if you don't do something similar to the commented
ideas, then how should be closed down in an ordered way processes for
users which should be disconnected with certain urgency due to non
allowed accesses with the username being detected?.
Cheers Bron!
El 2024-09-26 15:02, Bron Gondwana escribió:
Hi! Sorry nobody got back to you earlier on this.
In an example of sychronicity, I recently wrote something touching the
same area:
https://github.com/cyrusimap/cyrus-imapd/pull/5036
I solved this a different way - if the "babysit" option is set (it's
automatic for DAEMON block and can be added to each SERVICE) then we
only wait one set of MAX_READY_FAIL_INTERVAL before we'll start
spawning processes again.
However... your fix also seems sound for the non-babysit case. Do you
want to create it as a Pull Request against
https://github.com/cyrusimap/cyrus-imapd [1] master? That's the
easiest for us to work with, and ensures you get proper attribution!
Cheers,
Bron.
On Tue, Sep 10, 2024, at 12:55, ego...@sarenet.es wrote:
Hi!,
Sometime ago, I wrote that when we send TERM to imapd (although it
would happen with any other I assume, pop, sieve...) procceses we
wanted to exit, due to a user request for disconnecting his/her
sessions, sometimes happened that was like, after that sessions
disconnection (TERM to imapd processes) no enough processes where
become spawned newly. Only sometimes when very few processes needed to
be killed.
I have been able to reproduce it. If a user has connected (because
proctitle() has set it in the name) and later in very few time "leaves"
(logouts for instance) and then the process moves to READY state if you
kill with TERM more than MAX_READY_FAILS units of that process in less
than MAX_READY_FAIL_INTERVAL, master won't spawn new processes as it's
written in master.c in lines near 1100 in reap_child() function.
It's suggested to launch a SIGHUP to master for activating again the
service, but it can't be enabled again because the service seems to
have removed from the s data structure but not stopped. Due to that non
process stop, when new imapd attemps to load (in service_create() ) it
can't be created because the socket is still in use.
So, for ensuring this is correct, I have written the following patch
for master.c and that I tested on 3.0.15 :
root@debugcyrus:/usr/ports/mail/cyrus-imapd30 # diff -u
work/cyrus-imapd-3.0.15/master/master.c /master.c-definitivo
--- work/cyrus-imapd-3.0.15/master/master.c 2021-03-09
04:27:45.000000000 +0100
+++ /master.c-definitivo 2024-09-10 18:36:49.797581000 +0200
@@ -129,6 +129,11 @@
};
static int verbose = 0;
+
+/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
+static int gotsigurg = 0;
+/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
+
static int listen_queue_backlog = 32;
static int pidfd = -1;
@@ -1047,6 +1052,22 @@
}
}
+/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
+static void sigurg_handler(int sig __attribute__((unused)))
+{
+ syslog(LOG_DEBUG, "URG CAPTURADO!!!!!!");
+
+ if (gotsigurg)
+ {
+ gotsigurg = 0;
+ }
+ else
+ {
+ gotsigurg = 1;
+ }
+}
+/* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
+
static void reap_child(void)
{
int status;
@@ -1094,10 +1115,24 @@
"terminated abnormally",
SERVICEPARAM(s->name),
SERVICEPARAM(s->familyname), pid);
- if (now - s->lastreadyfail >
MAX_READY_FAIL_INTERVAL) {
+
+ syslog(LOG_DEBUG, "Senal URG vale....
--%d--",gotsigurg);
+
+ if ((now - s->lastreadyfail >
MAX_READY_FAIL_INTERVAL) || (gotsigurg))
+ {
s->nreadyfails = 0;
+
+ if (gotsigurg)
+ {
+ syslog(LOG_DEBUG, "RESETEANDO....");
+ syslog(LOG_DEBUG, "RESETEADO....");
+ }
}
+
+ syslog(LOG_DEBUG, "too many failures for
service %s/%s, resetting counters due to SIGURG received in Cyrus
master. El got vale
--%d--",SERVICEPARAM(s->name),SERVICEPARAM(s->familyname),gotsigurg);
+
s->lastreadyfail = now;
+
if (++s->nreadyfails >= MAX_READY_FAILS && s->exec) {
syslog(LOG_ERR, "too many failures for "
"service %s/%s, disabling until next SIGHUP",
@@ -1305,11 +1340,18 @@
sigemptyset(&action.sa_mask);
action.sa_handler = sighup_handler;
+
#ifdef SA_RESTART
action.sa_flags |= SA_RESTART;
#endif
if (sigaction(SIGHUP, &action, NULL) < 0)
fatalf(1, "unable to install signal handler for SIGHUP: %m");
+
+ /* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
+ action.sa_handler = sigurg_handler;
+ if (sigaction(SIGURG, &action, NULL) < 0)
+ fatalf(1, "unable to install signal handler for SIGURG: %m");
+ /* RESET MAX_READY_FAILS OF SERVICE IN MASTER WRAPPER */
action.sa_handler = sigalrm_handler;
if (sigaction(SIGALRM, &action, NULL) < 0)
root@debugcyrus:/usr/ports/mail/cyrus-imapd30 #
So by what I have seen Cyrus wrapper is written (and the way it handles
services), I think that the possible solutions could be :
* Send a kill(pid,SIGTERM) for ensuring the process die before
forgetting from s structure.
* Do something similar as I have done, which gives you a time window
for having more failures than expected for some seconds and which later
could be undone with for instance the same signal sending.
Reproduced and something proposed at least.
What do you think about it? :)
Cheers!
Egoitz Aurrekoetxea
Departamento de sistemas
94 - 420 94 70
ego...@sarenet.es
www.sarenet.es [2]
Parque Tecnológico. Edificio 103
48170 Zamudio (Bizkaia)
Antes de imprimir este correo electrónico piense si es necesario
hacerlo.
--
Bron Gondwana
br...@fastmail.fm