On Oct 21, 2019, Jason Merrill <ja...@redhat.com> wrote:

> On Sat, Oct 19, 2019 at 12:08 AM Alexandre Oliva <ol...@gnu.org> wrote:

>> We might also wish to use different lock-breaking logic for that case,
>> too, e.g. checking that the timestamp of the dir didn't change by
>> comparing `ls -ld $lockdir` with what we got 30 seconds before.  If it
>> changed or the output is now empty, we just lost the race again.

>> It's unlikely that the dir would remain unchanged for several seconds
>> without the pid file, so if we get the same timestamp after 30 seconds,
>> it's likely that something went wrong with the lock holder,

> This patch uses stamps in the lock directory so that they are automatically
> cleared when the lock changes hands.

Nice!

>>                                                             though it's
>> not impossible to imagine a scenario in which the lock program that just
>> succeeded in creating the dir got stopped (^Z) or killed-9 just before
>> creating the PID file.

One kind of problem is a left-over lock after a kill -9.  That
doesn't provide a chance for the lock cleanup code to run in the
terminated process.  But the current code will eventually steal it
(maybe too soon) and proceed.


Another kind of problem could arise if a lock-needing program got
stopped at an unfortunate time that enabled the lock to be taken from it
(before creating the pid file), and then woke back up believing it still
has the lock.  If it were to carry out anything that really required
mutual exclusion, Bad Things (TM) could happen.

Some systems adopt a STOPITH (that's short for Shoot The Other Process
In The Head) to avoid that scenario, but we don't have that choice:
either we're stealing from a PID that's gone (safe-ish, assuming all
users on a single host, without a shared FS across separate PID
namespaces, but still potentially racy, see below), or we don't know who
we're stealing from (very likely racy, except after a full-system
crash).


>> Even then, maybe breaking the lock is not such a great idea in
>> general...

> It seems to me that the downside of stealing the lock (greater resource
> contention) is less than the downside of  erroring out (build fails, user
> has to intervene manually, possibly many hours later when they notice).

Oh, if that's indeed the only consequence, I tend to agree.  I was
worried about scenarios that actually required mutual exclusion.  For
that, stealing locks opens another can of worms since multiple processes
might decide to steal the lock at the same time, and then each one might
end up stealing the lock from others who'd just stole the lock, thinking
they're stealing from the dead or stopped process, with very
impredictable results.  For short, it's not hard to implement lock
breaking, it's just very hard to make it race-free.


We should probably have comments as to the intended use, to express a
preference for serialization, so that nobody ends up using it in cases
that actually depend on mutual exclusion (e.g. to avoid corrupting some
file by concurrently changing it) without disabling the racy lock
stealing machinery:

# Shell-based mutex using mkdir.  This script is used to prefer
# serialized execution to avoid consuming too much RAM.  If reusing it,
# bear in mind that the lock-breaking logic is not race-free, so disable
# it in err() if concurrent execution could cause more serious problems.

Ok with that change or somesuch.

(Shell functions used to be non-portable, but I think even super
portable autoconf-generated configure scripts use them now)

Thanks!

-- 
Alexandre Oliva, freedom fighter  he/him   https://FSFLA.org/blogs/lxo
Be the change, be Free!        FSF VP & FSF Latin America board member
GNU Toolchain Engineer                        Free Software Evangelist
Hay que enGNUrecerse, pero sin perder la terGNUra jamás - Che GNUevara

Reply via email to