And then try again (until ?).
The LRU is empty.
See you got one LRU per domain, so while evicting the buffer from VRAM
it is moved to the GTT domain and also removed from the LRU domain.
When no other task is trying to do a CS the LRU will sooner or later
become empty.
One possibility what happens here is that another process/thread is
moving buffers back in while the first process is trying to evict them.
Regards,
Christian.
Am 14.03.2017 um 17:31 schrieb Julien Isorce:
Hello,
While debugging a softlock that happens on an ioctl(RADEON_CS), I
found that it keeps looping indefinitely in the following loop:
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/ttm/ttm_bo.c#L819
That would be great if someone could explain the logic behind this
loop iteration. My understanding is that it tries to get a free node
to put the current buffer object by calling "ttm_bo_man_get_node". If
it fails with mem->mm_node as NULL (internally -ENOSPC) then it tries
to evict another buffer from the LRU by calling "ttm_mem_evict_first".
And then try again (until ?).
For some reasons, after some points while running an app that GL
upload a lot of images, these 2 functions keeps returning 0 with
mem->mm_node as NULL so the "while (true)" keeps looping indefinitely.
Which results in the process to be stuck in that ioctl for ever.
A nasty workaround is to break the loop after a threshold for the
number of iterations. It looks like it very rarely goes over 200. So
breaking if > 200 iteration and returning -ENOMEM allows the
application to get the hand back instead of being stuck. This is quite
helpful for the debugging phase but definitely not a proper fix.
A colleague found that changing ttm_bo_unreserve by __ttm_bo_unreserve
here
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/ttm/ttm_bo.c#L751
fixes this softlock. Because the later does not re-add the evicted
buffer to the LRU.
But we are unsure whether this is a proper fix or just a workaround,
providing this line exists since the first TTM commit in 2009. Any
comment ?
Also it looks like there is a recursion from:
radeon_cs_ioctl
radeon_cs_parser_relocs
radeon_bo_list_validate
ttm_bo_validate
ttm_bo_move_buffer
ttm_bo_mem_space @
ttm_bo_mem_force_space
ttm_mem_evict_first
ttm_bo_evict
ttm_bo_mem_space @
ttm_mem_evict_first
...
It looks it is meant to work like this but this make it complicated to
follow. So any input would be much appreciated. Especially about the
eviction mechanism + bo->evicted flag and how TTM manages the LRU for
corner cases like when the VRAM is full.
I tried kernel 4.4, 4.8 and git HEAD from last week.
Thx
Julien
_______________________________________________
dri-devel mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/dri-devel