Hi,

a brief update on this issue:

On 2026-03-26 11:18, Christian Kastner wrote:
>> On 2026-03-24 20:07, Paul Gevers wrote:
>>> I started a run on the amd64 host and witnessed that when the test was
>>> at the state as in the log, it seemed to be really not doing anything.
>>> Here the output of $(top)
>>
>>>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ 
>>> COMMAND
>>>    1021 debci     20   0  700076 250676   6508 S   0.0   0.1 123:12.61 
>>> test-backend-op
>>
>> Well, that makes it even more odd. The S state suggests this is blocking
>> on something which is surprising, given what this test does.

Paul was kind enough to send me some debug output from the CI worker.
These were the contents of /proc/<pid>/stack:

    [<0>] futex_wait_queue+0x68/0x90
    [<0>] __futex_wait+0x151/0x1c0
    [<0>] futex_wait+0x79/0x120
    [<0>] do_futex+0xcb/0x190
    [<0>] __x64_sys_futex+0x127/0x1e0
    [<0>] do_syscall_64+0x82/0x190
    [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

syscall 64 is semget(2), so the above indicates some kind of deadlock.


After a bit of research, the suspicion was that this deadlock is caused
because ggml

  (1) reads total available threads from CPU parameters and schedules
      its pool accordingly, but has fewer available in the LXC testbed

  (2) and/or at least one thread is being killed somehow while being
      waited on

  (3) some programming-related deadlock


A test upload of ggml to experimental that intentionally limits threads
to a maximum of 8 resolved the issue, which gives (1) some weight.

However, the same upload dumped some system info which pointed to 48
threads being available in the cgroup, which is what the CPU in question
supports (24C/48T), which speaks against (1).


At least, I think I now have a better chance at reproducing this, using
an appropriately sized cloud worker.

Best,
Christian

Reply via email to