Re: [PATCH] scsi: fix ata_port_wait_eh() hung caused by missing to wake up eh thread

zhengbin (A) Tue, 26 Mar 2019 06:30:51 -0700

On 2019/3/25 19:55, zhengbin (A) wrote:
> On 2019/3/25 18:02, Pavel Tikhomirov wrote:
>> Likely we should do the same for host_eh_scheduled++ as we do for 
>> host_failed++ in scsi_eh_inc_host_failed. Just put it in call_rcu. These 
>> way rcu_read_lock would be enough: if we don't see RECOVERY state in 
>> scsi_dec_host_busy, that means that host_eh_scheduled++ and 
>> corresponding scsi_eh_wakeup not yet happened, and they will handle 
>> wakeup for us.
>>
>> Bart did these rcu-way to make common path faster, so better we do it 
>> without slow mem-barrier.
Bart did these rcu-way, and protecting the host_failed checks with the SCSI 
host lock.
If we use rcu-way, maybe we need to put host_busy & shost_state in host_lock 
too.
I think we should use smp_mb().


Looking forward to reply, thanks.
> If we use rcu for host_eh_scheduled. we should add rcu_head in Scsi_Host,
> replace scsi_schedule_eh(ap->scsi_host) with call_rcu in ata_std_sched_eh, 
> sas_queue_reset??
> This will trigger a kernel hang or oops because of double or more call_rcu() 
> call.
> 
> Any more suggestions?>>
>> On 3/25/19 12:05 PM, zhengbin wrote:
>>> When I use fio test kernel in the following steps:
>>> 1.The sas controller mixes SAS/SATA disks
>>> 2.Use fio test all disks
>>> 3.Simultaneous enable/disable/link_reset/hard_reset PHY
>>>
>>> it will hung in ata_port_wait_eh
>>> Call trace:
>>>   __switch_to+0xb4/0x1b8
>>>   __schedule+0x1e8/0x718
>>>   schedule+0x38/0x90
>>>   ata_port_wait_eh+0x70/0xf8
>>>   sas_ata_wait_eh+0x24/0x30 [libsas]
>>>   transport_sas_phy_reset.isra.3+0x128/0x160 [libsas]
>>>   phy_reset_work+0x20/0x30 [libsas]
>>>   process_one_work+0x1e4/0x460
>>>   worker_thread+0x40/0x450
>>>   kthread+0x12c/0x130
>>>   ret_from_fork+0x10/0x18
>>>
>>> The key code process is like this:
>>> scsi_dec_host_busy
>>>     atomic_dec(&shost->host_busy);
>>>     if (unlikely(scsi_host_in_recovery(shost))) {
>>>             spin_lock_irqsave(shost->host_lock, flags);
>>>             ...
>>>             scsi_eh_wakeup(shost)
>>>             ...
>>>     }
>>>
>>> scsi_schedule_eh
>>>     spin_lock_irqsave(shost->host_lock, flags);
>>>     if (scsi_host_set_state(shost, SHOST_RECOVERY) == 0 ||
>>>         scsi_host_set_state(shost, SHOST_CANCEL_RECOVERY) == 0) {
>>>             ...
>>>             scsi_eh_wakeup(shost);
>>>     }
>>>
>>> scsi_eh_wakeup
>>>     if (scsi_host_busy(shost) == shost->host_failed)
>>>             wake_up_process(shost->ehandler);
>>>
>>> In scsi_dec_host_busy, host_busy & shost_state not in spinlock. Neither
>>> function wakes up the SCSI error handler in the following timing:
>>>
>>> CPU 0(call scsi_dec_host_busy)    CPU 1(call scsi_schedule_eh)
>>> LOAD shost_state(!=recovery)
>>>                                    scsi_host_set_state(SHOST_RECOVERY)
>>>                                    scsi_eh_wakeup(host_busy != host_failed)
>>> atomic_dec(&shost->host_busy);
>>> if (scsi_host_in_recovery(shost))
>>>
>>> Add a smp_mb between host_busy and shost_state.
>>>
>>> Signed-off-by: zhengbin <[email protected]>
>>> ---
>>>   drivers/scsi/scsi_error.c | 7 +++++++
>>>   drivers/scsi/scsi_lib.c   | 5 +++++
>>>   2 files changed, 12 insertions(+)
>>>
>>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>>> index 1b8378f..c605a71 100644
>>> --- a/drivers/scsi/scsi_error.c
>>> +++ b/drivers/scsi/scsi_error.c
>>> @@ -88,6 +88,13 @@ void scsi_schedule_eh(struct Scsi_Host *shost)
>>>
>>>     if (scsi_host_set_state(shost, SHOST_RECOVERY) == 0 ||
>>>         scsi_host_set_state(shost, SHOST_CANCEL_RECOVERY) == 0) {
>>> +           /*
>>> +            * We have to order shost_state store above and test of
>>> +            * the host_busy(scsi_eh_wakeup will test it), because
>>> +            * scsi_dec_host_busy accesses these variables without
>>> +            * host_lock.
>>> +            */
>>> +           smp_mb();
>>>             shost->host_eh_scheduled++;
>>>             scsi_eh_wakeup(shost);
>>>     }
>>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>>> index 601b9f1..9094d20 100644
>>> --- a/drivers/scsi/scsi_lib.c
>>> +++ b/drivers/scsi/scsi_lib.c
>>> @@ -337,6 +337,11 @@ static void scsi_dec_host_busy(struct Scsi_Host *shost)
>>>
>>>     rcu_read_lock();
>>>     atomic_dec(&shost->host_busy);
>>> +   /*
>>> +    * We have to order host_busy dec above and test of the shost_state
>>> +    * below outside the host_lock.
>>> +    */
>>> +   smp_mb();
>>>     if (unlikely(scsi_host_in_recovery(shost))) {
>>>             spin_lock_irqsave(shost->host_lock, flags);
>>>             if (shost->host_failed || shost->host_eh_scheduled)
>>> --
>>> 2.7.4
>>>
>>

Re: [PATCH] scsi: fix ata_port_wait_eh() hung caused by missing to wake up eh thread

Reply via email to