On Tue, Mar 3, 2026 at 1:29 AM Christian König <[email protected]> wrote: > > On 3/2/26 20:35, T.J. Mercier wrote: > > On Mon, Mar 2, 2026 at 7:51 AM Christian König <[email protected]> > > wrote: > >> > >> On 3/2/26 16:40, Shakeel Butt wrote: > >>> +TJ > >>> > >>> On Mon, Mar 02, 2026 at 03:37:37PM +0100, Christian König wrote: > >>>> On 3/2/26 15:15, Shakeel Butt wrote: > >>>>> On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote: > >>>>>> On 2/24/26 20:28, Dave Airlie wrote: > >>>>> [...] > >>>>>> > >>>>>>> This has been a pain in the ass for desktop for years, and I'd like to > >>>>>>> fix it, the HPC use case if purely a driver for me doing the work. > >>>>>> > >>>>>> Wait a second. How does accounting to cgroups help with that in any > >>>>>> way? > >>>>>> > >>>>>> The last time I looked into this problem the OOM killer worked based > >>>>>> on the per task_struct stats which couldn't be influenced this way. > >>>>>> > >>>>> > >>>>> It depends on the context of the oom-killer. If the oom-killer is > >>>>> triggered due > >>>>> to memcg limits then only the processes in the scope of the memcg will > >>>>> be > >>>>> targetted by the oom-killer. With the specific setting, the oom-killer > >>>>> can kill > >>>>> all the processes in the target memcg. > >>>>> > >>>>> However nowadays the userspace oom-killer is preferred over the kernel > >>>>> oom-killer due to flexibility and configurability. Userspace > >>>>> oom-killers like > >>>>> systmd-oomd, Android's LMKD or fb-oomd are being used in containerized > >>>>> environments. Such oom-killers looks at memcg stats and hiding something > >>>>> something from memcg i.e. not charging to memcg will hide such usage > >>>>> from these > >>>>> oom-killers. > >>>> > >>>> Well exactly that's the problem. Android's oom killer is *not* using > >>>> memcg exactly because of this inflexibility. > >>> > >>> Are you sure Android's oom killer is not using memcg? From what I see in > >>> the > >>> documentation [1], it requires memcg. > > > > LMKD used to use memcg v1 for memory.pressure_level, but that has been > > replaced by PSI which is now the default configuration. I deprecated > > all configurations with memcg v1 dependencies in January. We plan to > > remove the memcg v1 support from LMKD when the 5.10 and 5.15 kernels > > reach EOL. > > > >> My bad, I should have been wording that better. > >> > >> The Android OOM killer is not using memcg for tracking GPU memory > >> allocations, because memcg doesn't have proper support for tracking shared > >> buffers. > >> > >> In other words GPU memory allocations are shared by design and it is the > >> norm that the process which is using it is not the process which has > >> allocated it. > >> > >> What we would need (as a start) to handle all of this with memcg would be > >> to accounted the resources to the process which referenced it and not the > >> one which allocated it. > >> > >> I can give a full list of requirements which would be needed by cgroups to > >> cover all the different use cases, but it basically means tons of extra > >> complexity. > > > > Yeah this is right. We usually prioritize fast kills rather than > > picking the biggest offender though. Application state (foreground / > > background) is the primary selector, however LMKD does have a mode > > (kill_heaviest_task) where it will pick the largest task within a > > group of apps sharing the same application state. For this it uses RSS > > from /proc/<pid>/statm, and (prepare to avert your eyes) a new and out > > of tree interface in procfs for accounting dmabufs used by a process. > > It tracks FD references and map references as they come and go, and > > only counts any buffer once for a process regardless of the number and > > type of references a process has to the same buffer. I dislike it > > greatly. > > *sigh* I was really hoping that we would have nailed it with the BPF support > for DMA-buf and not rely on out of tree stuff any more.
The BPF support is still a win and I'm very happy to have it, but I don't think there was ever a route to implementing cgroup limits on top of it. > We should really stop re-inventing the wheel over and over again and fix the > shortcomings cgroups has instead and then use that one. > > > My original intention was to use the dmabuf BPF iterator we added to > > scan maps and FDs of a process for dmabufs on demand. Very simple and > > pretty fast in BPF. This wouldn't support high watermark tracking, so > > I was forced into doing something else for per-process accounting. To > > be fair, the HWM tracking has detected a few application bugs where > > 4GB of system memory was inadvertently consumed by dmabufs. > > > > The BPF iterator is currently used to support accounting of buffers > > not visible in userspace (dmabuf_dump / libdmabufinfo) and it's a nice > > improvement for that over the old sysfs interface. I hope to replace > > the slow scanning of procfs for dmabufs in libdmabufinfo with BPF > > programs that use the dmabuf iterator, but that's not a priority for > > this year. > > > > Independent of all of that, memcg doesn't really work well for this > > because it's shared memory that can only be attributed to a single > > memcg, and the most common allocator (gralloc) is in a separate > > process and memcg than the processes using the buffers (camera, > > YouTube, etc.). I had a few patches that transferred the ownership of > > buffers to a new memcg when they were sent via Binder, but this used > > the memcg v1 charge moving functionality which is now gone because it > > was so complicated. But that only works if there is one user that > > should be charged for the buffer anyway. What if it is shared by > > multiple applications and services? > > Well the "usual" (e.g. what you find in the literature and what other > operating systems do) approach is to use a proportional set size instead of > the resident set size: https://en.wikipedia.org/wiki/Proportional_set_size > > The problem is that a proportional set size is usually harder to come by. So > it means additional overhead, more complex interfaces etc... I added /proc/<pid>/dmabuf_pss as well, which actually isn't a horrible implementation if you consider that the entire buffer is pinned as long as there is any user. https://cs.android.com/android/kernel/superproject/+/common-android14-6.1:common/fs/proc/base.c;drc=1b269f8eb12649ec9370f4051ae049e54a31e3fe;l=3393 With page-based memcg accounting it would be much harder and more expensive. > Regards, > Christian. > > > > >> Regards, > >> Christian. > >> > >>> > >>> [1] https://source.android.com/docs/core/perf/lmkd > >>> > >>>> > >>>> See the multiple iterations we already had on that topic. Even including > >>>> reverting already upstream uAPI. > >>>> > >>>> The latest incarnation is that BPF is used for this task on Android. > >>>> > >>>> Regards, > >>>> Christian. > >> >
