On Tue, Mar 3, 2026 at 1:29 AM Christian König <[email protected]> wrote:
>
> On 3/2/26 20:35, T.J. Mercier wrote:
> > On Mon, Mar 2, 2026 at 7:51 AM Christian König <[email protected]> 
> > wrote:
> >>
> >> On 3/2/26 16:40, Shakeel Butt wrote:
> >>> +TJ
> >>>
> >>> On Mon, Mar 02, 2026 at 03:37:37PM +0100, Christian König wrote:
> >>>> On 3/2/26 15:15, Shakeel Butt wrote:
> >>>>> On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote:
> >>>>>> On 2/24/26 20:28, Dave Airlie wrote:
> >>>>> [...]
> >>>>>>
> >>>>>>> This has been a pain in the ass for desktop for years, and I'd like to
> >>>>>>> fix it, the HPC use case if purely a driver for me doing the work.
> >>>>>>
> >>>>>> Wait a second. How does accounting to cgroups help with that in any 
> >>>>>> way?
> >>>>>>
> >>>>>> The last time I looked into this problem the OOM killer worked based 
> >>>>>> on the per task_struct stats which couldn't be influenced this way.
> >>>>>>
> >>>>>
> >>>>> It depends on the context of the oom-killer. If the oom-killer is 
> >>>>> triggered due
> >>>>> to memcg limits then only the processes in the scope of the memcg will 
> >>>>> be
> >>>>> targetted by the oom-killer. With the specific setting, the oom-killer 
> >>>>> can kill
> >>>>> all the processes in the target memcg.
> >>>>>
> >>>>> However nowadays the userspace oom-killer is preferred over the kernel
> >>>>> oom-killer due to flexibility and configurability. Userspace 
> >>>>> oom-killers like
> >>>>> systmd-oomd, Android's LMKD or fb-oomd are being used in containerized
> >>>>> environments. Such oom-killers looks at memcg stats and hiding something
> >>>>> something from memcg i.e. not charging to memcg will hide such usage 
> >>>>> from these
> >>>>> oom-killers.
> >>>>
> >>>> Well exactly that's the problem. Android's oom killer is *not* using 
> >>>> memcg exactly because of this inflexibility.
> >>>
> >>> Are you sure Android's oom killer is not using memcg? From what I see in 
> >>> the
> >>> documentation [1], it requires memcg.
> >
> > LMKD used to use memcg v1 for memory.pressure_level, but that has been
> > replaced by PSI which is now the default configuration. I deprecated
> > all configurations with memcg v1 dependencies in January. We plan to
> > remove the memcg v1 support from LMKD when the 5.10 and 5.15 kernels
> > reach EOL.
> >
> >> My bad, I should have been wording that better.
> >>
> >> The Android OOM killer is not using memcg for tracking GPU memory 
> >> allocations, because memcg doesn't have proper support for tracking shared 
> >> buffers.
> >>
> >> In other words GPU memory allocations are shared by design and it is the 
> >> norm that the process which is using it is not the process which has 
> >> allocated it.
> >>
> >> What we would need (as a start) to handle all of this with memcg would be 
> >> to accounted the resources to the process which referenced it and not the 
> >> one which allocated it.
> >>
> >> I can give a full list of requirements which would be needed by cgroups to 
> >> cover all the different use cases, but it basically means tons of extra 
> >> complexity.
> >
> > Yeah this is right. We usually prioritize fast kills rather than
> > picking the biggest offender though. Application state (foreground /
> > background) is the primary selector, however LMKD does have a mode
> > (kill_heaviest_task) where it will pick the largest task within a
> > group of apps sharing the same application state. For this it uses RSS
> > from /proc/<pid>/statm, and (prepare to avert your eyes) a new and out
> > of tree interface in procfs for accounting dmabufs used by a process.
> > It tracks FD references and map references as they come and go, and
> > only counts any buffer once for a process regardless of the number and
> > type of references a process has to the same buffer. I dislike it
> > greatly.
>
> *sigh* I was really hoping that we would have nailed it with the BPF support 
> for DMA-buf and not rely on out of tree stuff any more.

The BPF support is still a win and I'm very happy to have it, but I
don't think there was ever a route to implementing cgroup limits on
top of it.

> We should really stop re-inventing the wheel over and over again and fix the 
> shortcomings cgroups has instead and then use that one.
>
> > My original intention was to use the dmabuf BPF iterator we added to
> > scan maps and FDs of a process for dmabufs on demand. Very simple and
> > pretty fast in BPF. This wouldn't support high watermark tracking, so
> > I was forced into doing something else for per-process accounting. To
> > be fair, the HWM tracking has detected a few application bugs where
> > 4GB of system memory was inadvertently consumed by dmabufs.
> >
> > The BPF iterator is currently used to support accounting of buffers
> > not visible in userspace (dmabuf_dump / libdmabufinfo) and it's a nice
> > improvement for that over the old sysfs interface. I hope to replace
> > the slow scanning of procfs for dmabufs in libdmabufinfo with BPF
> > programs that use the dmabuf iterator, but that's not a priority for
> > this year.
> >
> > Independent of all of that, memcg doesn't really work well for this
> > because it's shared memory that can only be attributed to a single
> > memcg, and the most common allocator (gralloc) is in a separate
> > process and memcg than the processes using the buffers (camera,
> > YouTube, etc.). I had a few patches that transferred the ownership of
> > buffers to a new memcg when they were sent via Binder, but this used
> > the memcg v1 charge moving functionality which is now gone because it
> > was so complicated. But that only works if there is one user that
> > should be charged for the buffer anyway. What if it is shared by
> > multiple applications and services?
>
> Well the "usual" (e.g. what you find in the literature and what other 
> operating systems do) approach is to use a proportional set size instead of 
> the resident set size: https://en.wikipedia.org/wiki/Proportional_set_size
>
> The problem is that a proportional set size is usually harder to come by. So 
> it means additional overhead, more complex interfaces etc...

I added /proc/<pid>/dmabuf_pss as well, which actually isn't a
horrible implementation if you consider that the entire buffer is
pinned as long as there is any user.

https://cs.android.com/android/kernel/superproject/+/common-android14-6.1:common/fs/proc/base.c;drc=1b269f8eb12649ec9370f4051ae049e54a31e3fe;l=3393

With page-based memcg accounting it would be much harder and more expensive.

> Regards,
> Christian.
>
> >
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> [1] https://source.android.com/docs/core/perf/lmkd
> >>>
> >>>>
> >>>> See the multiple iterations we already had on that topic. Even including 
> >>>> reverting already upstream uAPI.
> >>>>
> >>>> The latest incarnation is that BPF is used for this task on Android.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>
>

Reply via email to