On Jun 27, 2023, at 1:52 PM, Kurt Miller <[email protected]> wrote:
> 
> On Jun 14, 2023, at 12:51 PM, Vitaliy Makkoveev <[email protected]> wrote:
>> 
>> On Tue, May 30, 2023 at 01:31:08PM +0200, Martin Pieuchot wrote:
>>> So it seems the java process is holding the `sysctl_lock' for too long
>>> and block all other sysctl(2).  This seems wrong to me.  We should come
>>> up with a clever way to prevent vslocking too much memory.  A single
>>> lock obviously doesn't fly with that many CPUs. 
>>> 
>> 
>> We vslock memory to prevent context switch while doing copyin() and
>> copyout(), right? This is required for avoid context switch within foreach
>> loops of kernel lock protected lists. But this seems not be required for
>> simple sysctl_int() calls or rwlock protected data. So sysctl_lock
>> acquisition and the uvm_vslock() calls could be avoided for significant
>> count of mibs and pushed deep down for the rest.
> 
> I’m back on -current testing and have some additional findings that
> may help a bit. The memory leak fix had no effect on this issue. -current
> behavior is as I previously described. When java trips the issue, it 
> goes into a state where many threads are all running at 100% cpu but 
> does not make forward progress. I’m going to call this state run-away java
> process. Java is calling sched_yield(2) when in this state.
> 
> When java is in run-away state, a different process can trip
> the next stage were processes block waiting on sysctllk indefinitely.
> Top with process arguments is one, pgrep and ps -axl also trip this.
> My last test on -current java was stuck in run-away state for 7 hours
> 45 minutes before cron daily ran and cause the lockups.
> 
> I did a test with -current + locking sched_yield() back up with the
> kernel lock. The behavior changed slightly. Java still enters run-away
> state occasionally but eventually does make forward progress and 
> complete. When java is in run-away state the sysctllk issue can still
> be tripped, but if it is not tripped java eventually completes. For 
> about 200 invocations of a java command that usually takes 50 seconds
> to complete, 4 times java entered run-away state but eventually completed:
> 
> Typically it runs like this:
>    0m51.16s real     5m09.37s user     0m49.96s system
> 
> The exceptions look like this:
>    1m11.15s real     5m35.88s user    13m20.47s system 
>   27m18.93s real    31m13.19s user   754m48.41s system
>   13m44.44s real    19m56.11s user   501m39.73s system 
>   19m23.72s real    24m40.97s user   629m08.16s system
> 
> Testing -current with dumbsched.3 behaves the same as -current described
> above.
> 
> One other thing I observed so far is what happens when egdb is 
> Attached to the run-away java process. egdb stops the process
> using ptrace(2) PT_ATTACH. Now if I issue a command that would
> typically lock up the system like top displaying command line
> arguments, the system does not lock up. I think this rules out
> the kernel memory is fragmented theory.
> 
> Switching cpu’s in ddb tends to lock up ddb so I have limited
> info but here what I have from -current lockup and -current
> with dumbsched.3 lockup. 

Another data point to support the idea of a missing wakeup; when
java is in run-away state, if I send SIGSTOP followed by SIGCONT
it dislodges it from run-away state and returns to normal operation.

Reply via email to