On 3/2/26 20:35, T.J. Mercier wrote:
> On Mon, Mar 2, 2026 at 7:51 AM Christian König <[email protected]> 
> wrote:
>>
>> On 3/2/26 16:40, Shakeel Butt wrote:
>>> +TJ
>>>
>>> On Mon, Mar 02, 2026 at 03:37:37PM +0100, Christian König wrote:
>>>> On 3/2/26 15:15, Shakeel Butt wrote:
>>>>> On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote:
>>>>>> On 2/24/26 20:28, Dave Airlie wrote:
>>>>> [...]
>>>>>>
>>>>>>> This has been a pain in the ass for desktop for years, and I'd like to
>>>>>>> fix it, the HPC use case if purely a driver for me doing the work.
>>>>>>
>>>>>> Wait a second. How does accounting to cgroups help with that in any way?
>>>>>>
>>>>>> The last time I looked into this problem the OOM killer worked based on 
>>>>>> the per task_struct stats which couldn't be influenced this way.
>>>>>>
>>>>>
>>>>> It depends on the context of the oom-killer. If the oom-killer is 
>>>>> triggered due
>>>>> to memcg limits then only the processes in the scope of the memcg will be
>>>>> targetted by the oom-killer. With the specific setting, the oom-killer 
>>>>> can kill
>>>>> all the processes in the target memcg.
>>>>>
>>>>> However nowadays the userspace oom-killer is preferred over the kernel
>>>>> oom-killer due to flexibility and configurability. Userspace oom-killers 
>>>>> like
>>>>> systmd-oomd, Android's LMKD or fb-oomd are being used in containerized
>>>>> environments. Such oom-killers looks at memcg stats and hiding something
>>>>> something from memcg i.e. not charging to memcg will hide such usage from 
>>>>> these
>>>>> oom-killers.
>>>>
>>>> Well exactly that's the problem. Android's oom killer is *not* using memcg 
>>>> exactly because of this inflexibility.
>>>
>>> Are you sure Android's oom killer is not using memcg? From what I see in the
>>> documentation [1], it requires memcg.
> 
> LMKD used to use memcg v1 for memory.pressure_level, but that has been
> replaced by PSI which is now the default configuration. I deprecated
> all configurations with memcg v1 dependencies in January. We plan to
> remove the memcg v1 support from LMKD when the 5.10 and 5.15 kernels
> reach EOL.
> 
>> My bad, I should have been wording that better.
>>
>> The Android OOM killer is not using memcg for tracking GPU memory 
>> allocations, because memcg doesn't have proper support for tracking shared 
>> buffers.
>>
>> In other words GPU memory allocations are shared by design and it is the 
>> norm that the process which is using it is not the process which has 
>> allocated it.
>>
>> What we would need (as a start) to handle all of this with memcg would be to 
>> accounted the resources to the process which referenced it and not the one 
>> which allocated it.
>>
>> I can give a full list of requirements which would be needed by cgroups to 
>> cover all the different use cases, but it basically means tons of extra 
>> complexity.
> 
> Yeah this is right. We usually prioritize fast kills rather than
> picking the biggest offender though. Application state (foreground /
> background) is the primary selector, however LMKD does have a mode
> (kill_heaviest_task) where it will pick the largest task within a
> group of apps sharing the same application state. For this it uses RSS
> from /proc/<pid>/statm, and (prepare to avert your eyes) a new and out
> of tree interface in procfs for accounting dmabufs used by a process.
> It tracks FD references and map references as they come and go, and
> only counts any buffer once for a process regardless of the number and
> type of references a process has to the same buffer. I dislike it
> greatly.

*sigh* I was really hoping that we would have nailed it with the BPF support 
for DMA-buf and not rely on out of tree stuff any more.

We should really stop re-inventing the wheel over and over again and fix the 
shortcomings cgroups has instead and then use that one.

> My original intention was to use the dmabuf BPF iterator we added to
> scan maps and FDs of a process for dmabufs on demand. Very simple and
> pretty fast in BPF. This wouldn't support high watermark tracking, so
> I was forced into doing something else for per-process accounting. To
> be fair, the HWM tracking has detected a few application bugs where
> 4GB of system memory was inadvertently consumed by dmabufs.
> 
> The BPF iterator is currently used to support accounting of buffers
> not visible in userspace (dmabuf_dump / libdmabufinfo) and it's a nice
> improvement for that over the old sysfs interface. I hope to replace
> the slow scanning of procfs for dmabufs in libdmabufinfo with BPF
> programs that use the dmabuf iterator, but that's not a priority for
> this year.
> 
> Independent of all of that, memcg doesn't really work well for this
> because it's shared memory that can only be attributed to a single
> memcg, and the most common allocator (gralloc) is in a separate
> process and memcg than the processes using the buffers (camera,
> YouTube, etc.). I had a few patches that transferred the ownership of
> buffers to a new memcg when they were sent via Binder, but this used
> the memcg v1 charge moving functionality which is now gone because it
> was so complicated. But that only works if there is one user that
> should be charged for the buffer anyway. What if it is shared by
> multiple applications and services?

Well the "usual" (e.g. what you find in the literature and what other operating 
systems do) approach is to use a proportional set size instead of the resident 
set size: https://en.wikipedia.org/wiki/Proportional_set_size

The problem is that a proportional set size is usually harder to come by. So it 
means additional overhead, more complex interfaces etc...

Regards,
Christian.

> 
>> Regards,
>> Christian.
>>
>>>
>>> [1] https://source.android.com/docs/core/perf/lmkd
>>>
>>>>
>>>> See the multiple iterations we already had on that topic. Even including 
>>>> reverting already upstream uAPI.
>>>>
>>>> The latest incarnation is that BPF is used for this task on Android.
>>>>
>>>> Regards,
>>>> Christian.
>>

Reply via email to