On Wed, Feb 25, 2026 at 08:49:00AM -0700, Keith Busch wrote:
> On Wed, Feb 25, 2026 at 03:32:21PM +0000, John Garry wrote:
> > +static int mpath_pr_register(struct block_device *bdev, u64 old_key,
> > +                   u64 new_key, unsigned int flags)
> > +{
> > +   struct mpath_disk *mpath_disk = dev_get_drvdata(&bdev->bd_device);
> > +   struct mpath_head *mpath_head = mpath_disk->mpath_head;
> > +   struct mpath_device *mpath_device;
> > +   int srcu_idx, ret = -EWOULDBLOCK;
> > +
> > +   srcu_idx = srcu_read_lock(&mpath_head->srcu);
> > +   mpath_device = mpath_find_path(mpath_head);
> > +   if (mpath_device)
> > +           ret = mpath_head->mpdt->pr_ops->pr_register(mpath_device,
> > +                           old_key, new_key, flags);
> > +   srcu_read_unlock(&mpath_head->srcu, srcu_idx);
> 
> Instead of having the lower layer define new mp template functions, why
> not use the existing pr_ops from mpath_device->disk->fops->pr_ops?

I don't think that's the right answer. The regular scsi persistent
reservation functions simply won't work on a multipath device. Even just
a simple reservation fails.

For example (with /dev/sda being multipath device 0):
# echo round-robin > /sys/class/scsi_mpath_device/0/iopolicy
# blkpr -c register -k 0x1 /dev/sda
# blkpr -c reserve -k 0x1 -t exclusive-access-reg-only /dev/sda
# dd if=/dev/sda of=/dev/null iflag=direct count=100
dd: error reading '/dev/sda': Invalid exchange
1+0 records in
1+0 records out
512 bytes copied, 0.00871312 s, 58.8 kB/s

Here are the kernel messages:
[ 3494.660401] sd 7:0:1:0: reservation conflict
[ 3494.661802] sd 7:0:1:0: [sda:1] tag#768 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_OK cmd_age=0s
[ 3494.664848] sd 7:0:1:0: [sda:1] tag#768 CDB: Read(10) 28 00 00 00 00 01 00 
00 01 00
[ 3494.667092] reservation conflict error, dev sda:1, sector 1 op 0x0:(READ) 
flags 0x2800800 phys_seg 1 prio class 2

If you don't have a multipathed scsi device to try this on, you can run:

targetcli <<EOF
/backstores/ramdisk create mptest 1G
/loopback create naa.5001401111111111
/loopback create naa.5001402222222222
/loopback create naa.5001403333333333
/loopback create naa.5001404444444444
/loopback/naa.5001401111111111/luns create /backstores/ramdisk/mptest
/loopback/naa.5001402222222222/luns create /backstores/ramdisk/mptest
/loopback/naa.5001403333333333/luns create /backstores/ramdisk/mptest
/loopback/naa.5001404444444444/luns create /backstores/ramdisk/mptest
EOF

to create one.

Handling scsi Persistent Reservations on a multipath device is painful.
Here is a non-exhaustive list of the problems with trying to make a
multipath device act like a single scsi device for persistent
reservation purposes:

You need to register the key on all the I_T Nexuses. You can't just pick
a single path. Otherwise, when you set up the reservation, you will only
be able to do IO on one of the paths. That's what happened above.

If an path is down when you do the resevation, you might not be able to
register the key on that path. You certainly can't do it directly.
Using the All Target Ports bit (assuming the device supports it) could
let you extend a reservation from one target port to others, assuming
your path isn't down because of connection issue on the host side. But
in general, you have to be able to handle the case where you can't
register (or unregister) a key on your failed paths. If you don't do
that (un)registration when the path comes up, before it can get seleted
for handling IO, you will fail when accessing a path you should be
allowed allowed to access, or succeed in accessing a path that you are
should not be allowed to access.

The same is true when new paths are discovered. You need to register
them.

Except that a preempt can come and remove your registration at any time.
You can't register the new (or newly active) path if the key has been
preempted, and this preemption can happen at any moment, even after you
check if the other paths are still registered. If this isn't handled
correctly, paths can access storage that they should not be allowed to
access.

Changing the reservation type (for instance from
exclusive-access-reg-only to write-exclusive-reg-only) in scsi devices
is done by preempting the existing reservation. This will remove the
registered keys from every path except the one issuing the command. The
key needs to be reregistered on all the other paths again. If any IO
goes to these paths before they are reregistered, it will fail with a
reservation conflict, so IO needs to be suspended during this time.

The path that is holding the reservation might be down. In this case,
you aren't able to release the reservation from that path. The only way
I figured out to handle this in dm-mpath was for the device to preempt
it's own key, to move the reservation to a working path. This causes the
same issues as preempting key to change the reservation type, where you
need to reregister all the paths with IO suspended.

An actual preemption can come in from another machine while you are
doing this. In that case, you must not reregister the paths, and if you
already started, you must unregister them.

I can probably come up with more issues.

I think the best course of action for now is to just fail persistent
reservations as non-supported for scsi devices. IMHO Making them work
correctly (where mulitpath device IO won't fail when it should succeed,
and succeed when it should fail with a reservation conflict) dwarfs the
amount of work necessary to support ALUA.

dm-mpath previously did a pretty good job handling Persistent
Reservations. But recently it became much better, because it become very
clear that pretty good is not good enough for what people what to do
with Persistent Reservations and multipath devices.

-Ben


Reply via email to