Re: [Xen-devel] Dom0 kernel 4.14 with SMP randomly crashing

Rishi Tue, 06 Nov 2018 12:22:35 -0800

On Wed, Nov 7, 2018 at 12:16 AM Rishi <[email protected]> wrote:

>
>
> On Tue, Nov 6, 2018 at 10:41 PM Rishi <[email protected]> wrote:
>
>>
>>
>> On Tue, Nov 6, 2018 at 5:47 PM Wei Liu <[email protected]> wrote:
>>
>>> On Tue, Nov 06, 2018 at 03:31:31PM +0530, Rishi wrote:
>>> >
>>> > So after knowing the stack trace, it appears that the CPU was getting
>>> stuck
>>> > for xen_hypercall_xen_version
>>>
>>> That hypercall is used when a PV kernel (re-)enables interrupts. See
>>> xen_irq_enable. The purpose is to force the kernel to switch to
>>> hypervisor.
>>>
>>> >
>>> > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
>>> >
>>> >
>>> > [30569.582740] watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
>>> > [swapper/0:0]
>>> >
>>> > [30569.588186] Kernel panic - not syncing: softlockup: hung tasks
>>> >
>>> > [30569.591307] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G             L
>>>   4.19.1
>>> > #1
>>> >
>>> > [30569.595110] Hardware name: Xen HVM domU, BIOS 4.4.1-xs132257
>>> 12/12/2016
>>> >
>>> > [30569.598356] Call Trace:
>>> >
>>> > [30569.599597]  <IRQ>
>>> >
>>> > [30569.600920]  dump_stack+0x5a/0x73
>>> >
>>> > [30569.602998]  panic+0xe8/0x249
>>> >
>>> > [30569.604806]  watchdog_timer_fn+0x200/0x230
>>> >
>>> > [30569.607029]  ? softlockup_fn+0x40/0x40
>>> >
>>> > [30569.609246]  __hrtimer_run_queues+0x133/0x270
>>> >
>>> > [30569.611712]  hrtimer_interrupt+0xfb/0x260
>>> >
>>> > [30569.613800]  xen_timer_interrupt+0x1b/0x30
>>> >
>>> > [30569.616972]  __handle_irq_event_percpu+0x69/0x1a0
>>> >
>>> > [30569.619831]  handle_irq_event_percpu+0x30/0x70
>>> >
>>> > [30569.622382]  handle_percpu_irq+0x34/0x50
>>> >
>>> > [30569.625048]  generic_handle_irq+0x1e/0x30
>>> >
>>> > [30569.627216]  __evtchn_fifo_handle_events+0x163/0x1a0
>>> >
>>> > [30569.629955]  __xen_evtchn_do_upcall+0x41/0x70
>>> >
>>> > [30569.632612]  xen_evtchn_do_upcall+0x27/0x50
>>> >
>>> > [30569.635136]  xen_do_hypervisor_callback+0x29/0x40
>>> >
>>> > [30569.638181] RIP: e030:xen_hypercall_xen_version+0xa/0x20
>>>
>>> What is the asm code for this RIP?
>>>
>>>
>>> Wei.
>>>
>>
>> The issue of crash is getting resolved with appending "noirqbalance" at
>> xen command line. This way all dom0 cpus are available but not irq balanced
>> at xen.
>>
>> Even though I'm running irqbalance service in dom0 the irqs seems to be
>> not moving. <- this is dom0 perspective, I do not know yet, if it follows
>> Xen irq.
>>
>> I tried objdump, while I have  have the function in out but there is no
>> asm code of it. Its just "..."
>>
>> ffffffff81001220 <xen_hypercall_xen_version>:
>>
>>         ...
>>
>>
>> ffffffff81001240 <xen_hypercall_console_io>:
>>
>>         ...
>>
>> All "hypercalls" appear similarly.
>>
>
> How frequent can be that hypercall/xen_irq_enable()? Like n/s or once a
> while?
> During my tests, the system runs stable unless I'm downloading a large
> file. Files around a GB size are getting downloaded without crash, but
> system crash comes when its above it. I'm using a 2.1GB file & wget to
> download.
>
> Is there a way I can simulate PV kernel (re-)enable of interrupt using a
> kernel module with a controlled fashion?
>


If this is on right track

ffffffff8101ab70 <xen_force_evtchn_callback>:

ffffffff8101ab70:       31 ff                   xor    %edi,%edi

ffffffff8101ab72:       31 f6                   xor    %esi,%esi

ffffffff8101ab74:       e8 a7 66 fe ff          callq  ffffffff81001220
<xen_hypercall_xen_version>

ffffffff8101ab79:       c3                      retq

ffffffff8101ab7a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

It seems I'm hitting following code from xen_irq_enable

        barrier(); /* unmask then check (avoid races) */

        if (unlikely(vcpu->evtchn_upcall_pending))
                xen_force_evtchn_callback();

The code says unlikely yet, it is being called, And I got following
structure

struct vcpu_info {

        /*

         * 'evtchn_upcall_pending' is written non-zero by Xen to indicate

         * a pending notification for a particular VCPU. It is then cleared

         * by the guest OS /before/ checking for pending work, thus avoiding

         * a set-and-check race. Note that the mask is only accessed by Xen

         * on the CPU that is currently hosting the VCPU. This means that
the

         * pending and mask flags can be updated by the guest without
special

         * synchronisation (i.e., no need for the x86 LOCK prefix).


 Let me know if the thread is being spammed with such intermediates.

_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Dom0 kernel 4.14 with SMP randomly crashing

Reply via email to