On 27.08.2021 11:29, Juergen Gross wrote:
> On 27.08.21 11:01, Jan Beulich wrote:
>> ballooning down Dom0 by about 16G in one go once in a while causes:
>>
>> BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 stuck for 64s!
>> Showing busy workqueues and worker pools:
>> workqueue events: flags=0x0
>>    pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
>>      in-flight: 229:balloon_process
>>      pending: cache_reap
>> workqueue events_freezable_power_: flags=0x84
>>    pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
>>      pending: disk_events_workfn
>> workqueue mm_percpu_wq: flags=0x8
>>    pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
>>      pending: vmstat_update
>> pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=64s workers=3 idle: 2222 43
>>
>> I've tried to double check that this isn't related to my IOMMU work
>> in the hypervisor, and I'm pretty sure it isn't. Looking at the
>> function I see it has a cond_resched(), but aiui this won't help
>> with further items in the same workqueue.
>>
>> Thoughts?
> 
> I'm seeing two possible solutions here:
> 
> 1. After some time (1 second?) in balloon_process() setup a new
>     workqueue activity and return (similar to EAGAIN, but without
>     increasing the delay).
> 
> 2. Don't use a workqueue for the ballooning activity, use a kernel
>     thread instead.
> 
> I have a slight preference for 2, even if the resulting patch will
> be larger. 1 is only working around the issue and it is hard to
> find a really good timeout value.
> 
> I'd be fine to write a patch, but would prefer some feedback which
> way to go.

Was there a particular reason that a workqueue was used in the first
place? Otherwise using a kernel thread would look like the way to
go, indeed. The presence of cond_resched() kind of indicates such an
intention already anyway.

Jan


Reply via email to