Hi,
I have a strange issue where LIO-T based ISCSI targets and LUNs most of the
time simply don’t work. They either don’t start, or bounce around until no more
nodes are tried.
The less-than-usefull information on the logs is like:
Aug 21 22:49:06 [10531] storage-1-prod pengine: warning:
check_migration_threshold: Forcing iscsi0-target away from storage-1-prod after
1000000 failures (max=1000000)
Aug 21 22:54:47 storage-1-prod crmd[2757]: notice: Result of start operation
for ip-iscsi0-vlan40 on storage-1-prod: 0 (ok)
Aug 21 22:54:47 storage-1-prod iSCSITarget(iscsi0-target)[5427]: WARNING:
Configuration parameter "tid" is not supported by the iSCSI implementation and
will be ignored.
Aug 21 22:54:48 storage-1-prod iSCSITarget(iscsi0-target)[5427]: INFO:
Parameter auto_add_default_portal is now 'false'.
Aug 21 22:54:48 storage-1-prod iSCSITarget(iscsi0-target)[5427]: INFO: Created
target iqn.2017-08.acccess.net:prod-1-ha. Created TPG 1.
Aug 21 22:54:48 storage-1-prod iSCSITarget(iscsi0-target)[5427]: ERROR: This
Target already exists in configFS
Aug 21 22:54:48 storage-1-prod crmd[2757]: notice: Result of start operation
for iscsi0-target on storage-1-prod: 1 (unknown error)
Aug 21 22:54:49 storage-1-prod iSCSITarget(iscsi0-target)[5536]: INFO: Deleted
Target iqn.2017-08.access.net:prod-1-ha.
Aug 21 22:54:49 storage-1-prod crmd[2757]: notice: Result of stop operation
for iscsi0-target on storage-1-prod: 0 (ok)
Now, the unknown error seems to actually be a targetcli type of error: "This
Target already exists in configFS”. Checking with targetcli shows zero
configured items on either node.
Manually starting the LUNs and target gives:
john@storage-1-prod:~$ sudo pcs resource debug-start iscsi0-target
Error performing operation: Operation not permitted
Operation start for iscsi0-target (ocf:heartbeat:iSCSITarget) returned 1
> stderr: WARNING: Configuration parameter "tid" is not supported by the
> iSCSI implementation and will be ignored.
> stderr: INFO: Parameter auto_add_default_portal is now 'false'.
> stderr: INFO: Created target iqn.2017-08.access.net:prod-1-ha. Created TPG
> 1.
> stderr: ERROR: This Target already exists in configFS
but now targetcli shows at least the target. Checking with crm status still
shows the target as stopped.
Manually starting the LUNs gives:
john@storage-1-prod:~$ sudo pcs resource debug-start iscsi0-lun0
Operation start for iscsi0-lun0 (ocf:heartbeat:iSCSILogicalUnit) returned 0
> stderr: INFO: Created block storage object iscsi0-lun0 using
> /dev/zvol/iscsipool0/iscsi/net.access.prod-1-ha-root.
> stderr: INFO: Created LUN 0.
> stderr: DEBUG: iscsi0-lun0 start : 0
john@storage-1-prod:~$ sudo pcs resource debug-start iscsi0-lun1
Operation start for iscsi0-lun1 (ocf:heartbeat:iSCSILogicalUnit) returned 0
> stderr: INFO: Created block storage object iscsi0-lun1 using
> /dev/zvol/iscsipool0/iscsi/net.access.prod-1-ha-swap.
> stderr: /usr/lib/ocf/resource.d/heartbeat/iSCSILogicalUnit: line 378:
> /sys/kernel/config/target/core/iblock_0/iscsi0-lun1/wwn/vpd_unit_serial: No
> such file or directory
> stderr: INFO: Created LUN 1.
> stderr: DEBUG: iscsi0-lun1 start : 0
So the second LUN seems to have some bad parameters created by the
iSCSILogicalUnit script. Checking with targetcli however shows both LUNs and
the target up and running.
Checking again with crm status (and pcs status) shows all three resources still
stopped. Since LUNs are colocated with the target and the target still has fail
counts, I clear them with:
sudo pcs resource cleanup iscsi0-target
Now the LUNs and target are all active in crm status / pcs status. But it’s
quite a manual process to get this to work! I’m thinking either my
configuration is bad or there is some bug somewhere in targetcli / LIO or the
iSCSI heartbeat script.
On top of all the manual work, it still breaks on any action. A move, failover,
reboot etc. instantly breaks it. Everything else (the underlying ZFS Pool, the
DRBD device, the IPv4 IP’s etc) moves just fine, it’s only the ISCSI that’s
being problematic.
Concrete questions:
- Is my config bad?
- Is there a known issue with ISCSI? (I have only found old references about
ordering)
I have added the output of crm config show as cib.txt and the output of a fresh
boot of both nodes is:
Current DC: storage-2-prod (version 1.1.16-94ff4df) - partition with quorum
Last updated: Mon Aug 21 22:55:05 2017
Last change: Mon Aug 21 22:36:23 2017 by root via cibadmin on storage-1-prod
2 nodes configured
21 resources configured
Online: [ storage-1-prod storage-2-prod ]
Full list of resources:
ip-iscsi0-vlan10 (ocf::heartbeat:IPaddr2): Started storage-1-prod
ip-iscsi0-vlan20 (ocf::heartbeat:IPaddr2): Started storage-1-prod
ip-iscsi0-vlan30 (ocf::heartbeat:IPaddr2): Started storage-1-prod
ip-iscsi0-vlan40 (ocf::heartbeat:IPaddr2): Started storage-1-prod
Master/Slave Set: drbd_master_slave0 [drbd_disk0]
Masters: [ storage-1-prod ]
Slaves: [ storage-2-prod ]
Master/Slave Set: drbd_master_slave1 [drbd_disk1]
Masters: [ storage-2-prod ]
Slaves: [ storage-1-prod ]
ip-iscsi1-vlan10 (ocf::heartbeat:IPaddr2): Started storage-2-prod
ip-iscsi1-vlan20 (ocf::heartbeat:IPaddr2): Started storage-2-prod
ip-iscsi1-vlan30 (ocf::heartbeat:IPaddr2): Started storage-2-prod
ip-iscsi1-vlan40 (ocf::heartbeat:IPaddr2): Started storage-2-prod
st-storage-1-prod (stonith:meatware): Started storage-2-prod
st-storage-2-prod (stonith:meatware): Started storage-1-prod
zfs-iscsipool0 (ocf::heartbeat:ZFS): Started storage-1-prod
zfs-iscsipool1 (ocf::heartbeat:ZFS): Started storage-2-prod
iscsi0-lun0 (ocf::heartbeat:iSCSILogicalUnit): Stopped
iscsi0-lun1 (ocf::heartbeat:iSCSILogicalUnit): Stopped
iscsi0-target (ocf::heartbeat:iSCSITarget): Stopped
Clone Set: dlm-clone [dlm]
Started: [ storage-1-prod storage-2-prod ]
Failed Actions:
* iscsi0-target_start_0 on storage-2-prod 'unknown error' (1): call=99,
status=complete, exitreason='none',
last-rc-change='Mon Aug 21 22:54:49 2017', queued=0ms, exec=954ms
* iscsi0-target_start_0 on storage-1-prod 'unknown error' (1): call=98,
status=complete, exitreason='none',
last-rc-change='Mon Aug 21 22:54:47 2017', queued=0ms, exec=1062ms
Regards,
John
node 180945669: storage-1-prod
node 180945670: storage-2-prod \
attributes
primitive dlm ocf:pacemaker:controld \
op start interval=0s timeout=90 \
op stop interval=0s timeout=100 \
op monitor interval=60s
primitive drbd_disk0 ocf:linbit:drbd \
params drbd_resource=disk0 \
op monitor interval=29s role=Master \
op monitor interval=31s role=Slave
primitive drbd_disk1 ocf:linbit:drbd \
params drbd_resource=disk1 \
op monitor interval=29s role=Master \
op monitor interval=31s role=Slave
primitive ip-iscsi0-vlan10 IPaddr2 \
params ip=10.201.0.25 nic=eno4 cidr_netmask=24 \
meta migration-threshold=2 target-role=Started \
op monitor interval=20 on-fail=restart timeout=60
primitive ip-iscsi0-vlan20 IPaddr2 \
params ip=10.201.1.25 nic=eno3 cidr_netmask=24 \
meta migration-threshold=2 target-role=Started \
op monitor interval=20 on-fail=restart timeout=60
primitive ip-iscsi0-vlan30 IPaddr2 \
params ip=10.201.2.25 nic=eno2 cidr_netmask=24 \
meta migration-threshold=2 target-role=Started \
op monitor interval=20 on-fail=restart timeout=60
primitive ip-iscsi0-vlan40 IPaddr2 \
params ip=10.201.3.25 nic=eno1 cidr_netmask=24 \
meta migration-threshold=2 target-role=Started \
op monitor interval=20 on-fail=restart timeout=60
primitive ip-iscsi1-vlan10 IPaddr2 \
params ip=10.201.0.26 nic=eno4 cidr_netmask=24 \
meta migration-threshold=2 \
op monitor interval=20 on-fail=restart timeout=60
primitive ip-iscsi1-vlan20 IPaddr2 \
params ip=10.201.1.26 nic=eno3 cidr_netmask=24 \
meta migration-threshold=2 \
op monitor interval=20 on-fail=restart timeout=60
primitive ip-iscsi1-vlan30 IPaddr2 \
params ip=10.201.2.26 nic=eno2 cidr_netmask=24 \
meta migration-threshold=2 \
op monitor interval=20 on-fail=restart timeout=60
primitive ip-iscsi1-vlan40 IPaddr2 \
params ip=10.201.3.26 nic=eno1 cidr_netmask=24 \
meta migration-threshold=2 \
op monitor interval=20 on-fail=restart timeout=60
primitive iscsi0-lun0 iSCSILogicalUnit \
params implementation=lio-t
target_iqn="iqn.2017-08.access.net:prod-1-ha" lun=0
path="/dev/zvol/iscsipool0/iscsi/net.access.prod-1-ha-root" \
meta target-role=Started
primitive iscsi0-lun1 iSCSILogicalUnit \
params implementation=lio-t
target_iqn="iqn.2017-08.access.net:prod-1-ha" lun=1
path="/dev/zvol/iscsipool0/iscsi/net.access.prod-1-ha-swap" \
meta target-role=Started
primitive iscsi0-target iSCSITarget \
params implementation=lio-t iqn="iqn.2017-08.access.net:prod-1-ha"
tid=1 \
op monitor interval=30s \
meta target-role=Started
primitive st-storage-1-prod stonith:meatware \
params hostlist=storage-1-prod \
meta target-role=Started
primitive st-storage-2-prod stonith:meatware \
params hostlist=storage-2-prod \
meta target-role=Started
primitive zfs-iscsipool0 ZFS \
params pool=iscsipool0 \
op start timeout=90 interval=0 \
op stop timeout=90 interval=0 \
meta target-role=Started
primitive zfs-iscsipool1 ZFS \
params pool=iscsipool1 \
op start timeout=90 interval=0 \
op stop timeout=90 interval=0 \
meta target-role=Started
ms drbd_master_slave0 drbd_disk0 \
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true target-role=Started
ms drbd_master_slave1 drbd_disk1 \
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true target-role=Started
clone dlm-clone dlm \
meta clone-max=2 clone-node-max=1 target-role=Started
location cli-prefer-drbd_master_slave0 drbd_master_slave0 role=Master inf:
storage-1-prod
location cli-prefer-drbd_master_slave1 drbd_master_slave1 role=Started inf:
storage-2-prod
location cli-prefer-zfs-iscsipool0 zfs-iscsipool0 role=Started inf:
storage-1-prod
location cli-prefer-zfs-iscsipool1 zfs-iscsipool1 role=Started inf:
storage-2-prod
order ip0-after-drbd0 inf: drbd_master_slave0:promote zfs-iscsipool0
ip-iscsi1-vlan10 ip-iscsi1-vlan20 ip-iscsi1-vlan30 ip-iscsi1-vlan40
iscsi0-target iscsi0-lun0 iscsi0-lun1
order ip1-after-drbd1 inf: drbd_master_slave1:promote zfs-iscsipool1
ip-iscsi0-vlan10 ip-iscsi0-vlan20 ip-iscsi0-vlan30 ip-iscsi0-vlan40
location l-st-storage-1-prod st-storage-1-prod -inf: storage-1-prod
location l-st-storage-2-prod st-storage-2-prod -inf: storage-2-prod
location lun0-prefer-iscsipool0 iscsi0-target role=Started inf: storage-1-prod
location lun1-prefer-iscsipool0 iscsi0-lun1 role=Started inf: storage-1-prod
location storage-0 { drbd_master_slave0 zfs-iscsipool0 ip-iscsi0-vlan10
ip-iscsi0-vlan20 ip-iscsi0-vlan30 ip-iscsi0-vlan40 iscsi0-target iscsi0-lun0
iscsi0-lun1 } 100: storage-1-prod
location storage-1 { drbd_master_slave1 zfs-iscsipool1 ip-iscsi1-vlan10
ip-iscsi1-vlan20 ip-iscsi1-vlan30 ip-iscsi1-vlan40 } 100: storage-2-prod
colocation storage-target0 inf: ip-iscsi0-vlan10 ip-iscsi0-vlan20
ip-iscsi0-vlan30 ip-iscsi0-vlan40 zfs-iscsipool0 drbd_master_slave0:Master
colocation storage-target1 inf: ip-iscsi1-vlan10 ip-iscsi1-vlan20
ip-iscsi1-vlan30 ip-iscsi1-vlan40 zfs-iscsipool1 drbd_master_slave1:Master
location target0-prefer-iscsipool0 iscsi0-target role=Started inf:
storage-1-prod
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.16-94ff4df \
cluster-infrastructure=corosync \
cluster-name=access_storage \
stonith-enabled=true \
no-quorum-policy=ignore \
default-resource-stickiness=100
_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org