Package: multipath-tools
Version: 0.5.0-6+deb8u1
Severity: critical
Tags: patch


Configuration:
I have the following setup: 
Dell PowerEdge M620 + QLogic ISP2532-based 8GB Fibre Channel to PCI Express HBA 
attached to our SAN with multipath.
OS is Debian Jessie 8.1
The Servers root file system resides on a LVM logical Volume.
The packages multipath-tools and multipath-tools-boot were installed.

Symptom:
Approximately 50% of the time the server won't boot correctly. (Depending on 
the outcome of the race condition between udev and multipathd [see below])
The password prompt for entering single user mode (or rescue.target) appears.

Problem:
The problem seems to be the same, Will Aoki already reported for 
upgrade-reports in the bug report 788295.
He was using open-iscsi, while I'm using a FC-HBA with the qla2xxx module. I'm 
guessing other combinations are affected too.

Bug 788295 has a very detailed analysis of the problem. The provided logs 
correlate with mine.
Since 788295 was filed against upgrade-reports, it'll probably not get fixed, 
hence this report.

Further Information:
Existing Debian bug report: 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=788295
Ubuntu fixed the issue. See 
https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1431650
Ubuntu Package with fix: 
http://packages.ubuntu.com/trusty-updates/multipath-tools
See also the comment of the patch taken from Ubuntu for more technical details.

Solution:
The following patch, taken from the Ubuntu package solved the problem for me 
and Will Aoki.
Could you please add this patch to the official Debian package and if possible 
get the fixed package into jessie-updates and the next jessie release?

------------------- START OF PATCH -----------------
>From 841977fc9c3432702c296d6239e4a54291a6007a Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <h...@suse.de>
Date: Tue, 24 Jun 2014 08:49:15 +0200
Subject: [PATCH] libmultipath: use a shared lock to co-operate with udev

udev since v214 is placing a shared lock on the device node
whenever it's processing the event. This introduces a race
condition with multipathd, as multipathd is processing the
event for the block device at the same time as udev is
processing the events for the partitions.
And a lock on the partitions will also be visible on the
block device itself, hence multipathd won't be able to
lock the device.
When multipath manages to take a lock on the device,
udev will fail, and consequently ignore this entire event.
Which in turn might cause the system to malfunction as it
might have been a crucial event like 'remove' or 'link down'.

So we should better use LOCK_SH here; with that the flock
call in multipathd _and_ udev will succeed and the events
can be processed.

References: bnc#883878

Signed-off-by: Hannes Reinecke <h...@suse.de>
---
 libmultipath/configure.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libmultipath/configure.c b/libmultipath/configure.c
index 0ddd3d5..dc2ebf0 100644
--- a/libmultipath/configure.c
+++ b/libmultipath/configure.c
@@ -529,7 +529,7 @@ lock_multipath (struct multipath * mpp, int lock)
                if (!pgp->paths)
                        continue;
                vector_foreach_slot(pgp->paths, pp, j) {
-                       if (lock && flock(pp->fd, LOCK_EX | LOCK_NB) &&
+                       if (lock && flock(pp->fd, LOCK_SH | LOCK_NB) &&
                            errno == EWOULDBLOCK)
                                goto fail;
                        else if (!lock)

------------------- END OF PATCH -----------------

Additional comments:
Why I rated this critical: (1) The Ubuntu bug is rated critical. (2) I think 
the "makes unrelated software on the system (or the whole system) break" clause 
applies when a system does not reliably boot anymore.
I can provide journal entries of a failed boot attempt if necessary. Since such 
logs already exist in bug 788295 and a tested patch exists, I thought it wasn't.

Kind Regards
Niels Baumgartner

Reply via email to