On Wed, Mar 25, 2026 at 12:40 AM Masahiko Sawada <[email protected]> wrote: > > Hi Lukas, > > On Sat, Feb 28, 2026 at 3:59 PM Lukas Fittl <[email protected]> wrote: > > > > Hi, > > > > See attached a patch that implements a new function, > > pg_buffercache_relation_stats(), which returns per-relfilenode > > statistics on the number of buffers, how many are dirtied/pinned, and > > their avg usage count. > > Thank you for the proposal! > > Paul A Jungwirth, Khoa Nguyen, and I reviewed this patch through the > Patch Review Workshop, and I'd like to share our comments. > > > > > This can be used in monitoring scripts to know which relations are > > kept in shared buffers, to understand performance issues better that > > occur due to relations getting evicted from the cache. In our own > > monitoring tool (pganalyze) we've offered a functionality like this > > based on the existing pg_buffercache() function for a bit over a year > > now [0], and people have found this very valuable - but it doesn't > > work for larger database servers. > > > > Specifically, performing a query that gets this information can be > > prohibitively expensive when using large shared_buffers, and even on > > the default 128MB shared buffers there is a measurable difference: > > > > postgres=# WITH pg_buffercache_relation_stats AS ( > > SELECT relfilenode, reltablespace, reldatabase, relforknumber, > > COUNT(*) AS buffers, > > COUNT(*) FILTER (WHERE isdirty) AS buffers_dirty, > > COUNT(*) FILTER (WHERE pinning_backends > 0) AS buffers_pinned, > > AVG(usagecount) AS usagecount_avg > > FROM pg_buffercache > > WHERE reldatabase IS NOT NULL > > GROUP BY 1, 2, 3, 4 > > > > ) > > SELECT * FROM pg_buffercache_relation_stats WHERE relfilenode = 2659; > > > > relfilenode | reltablespace | reldatabase | relforknumber | buffers | > > buffers_dirty | buffers_pinned | usagecount_avg > > -------------+---------------+-------------+---------------+---------+---------------+----------------+-------------------- > > 2659 | 1663 | 5 | 0 | 8 | > > 0 | 0 | 5.0000000000000000 > > 2659 | 1663 | 1 | 0 | 7 | > > 0 | 0 | 5.0000000000000000 > > 2659 | 1663 | 229553 | 0 | 7 | > > 0 | 0 | 5.0000000000000000 > > (3 rows) > > > > Time: 20.991 ms > > > > postgres=# SELECT * FROM pg_buffercache_relation_stats() WHERE > > relfilenode = 2659; > > relfilenode | reltablespace | reldatabase | relforknumber | buffers | > > buffers_dirty | buffers_pinned | usagecount_avg > > -------------+---------------+-------------+---------------+---------+---------------+----------------+---------------- > > 2659 | 1663 | 1 | 0 | 7 | > > 0 | 0 | 5 > > 2659 | 1663 | 229553 | 0 | 7 | > > 0 | 0 | 5 > > 2659 | 1663 | 5 | 0 | 8 | > > 0 | 0 | 5 > > (3 rows) > > > > Time: 2.912 ms > > > > With the new function this gets done before putting the data in the > > tuplestore used for the set-returning function. > > Overall, we find that the proposed feature is useful. The proposed way > is much cheaper, especially when the number of per-relation stats is > not large. > > Here are review comments on the v1 patch: > > --- > - pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql > + pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql \ > + pg_buffercache--1.7--1.8.sql > > Since commit 4b203d499c6 bumped the version from 1.6 to 1.7 last > November, we think we don't need to bump the version again for this new > feature. > > --- > +/* > + * Hash key for pg_buffercache_relation_stats — groups by relation identity. > + */ > +typedef struct > +{ > + RelFileNumber relfilenumber; > + Oid reltablespace; > + Oid reldatabase; > + ForkNumber forknum; > +} BufferRelStatsKey; > + > +/* > + * Hash entry for pg_buffercache_relation_stats — accumulates per-relation > + * buffer statistics. > + */ > +typedef struct > +{ > + BufferRelStatsKey key; /* must be first */ > + int32 buffers; > + int32 buffers_dirty; > + int32 buffers_pinned; > + int64 usagecount_total; > +} BufferRelStatsEntry; > > Can we move these typedefs above function prototypes as other typedefs > are defined there? > > --- > + relstats_hash = hash_create("pg_buffercache relation stats", > + 128, > + &hash_ctl, > + HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); > > It might be worth considering simplehash.h for even better performance. > > --- > + while ((entry = (BufferRelStatsEntry *) hash_seq_search(&hash_seq)) != > NULL) > + { > + if (entry->buffers == 0) > + continue; > + > > We might want to put CHECK_FOR_INTERRUPTS() here too as the number of > entries can be as many as NBuffers in principle. > > --- > We've discussed there might be room for improvement in the function > name. For example, pg_buffercache_relations instead of > pg_buffercache_relation_stats might be a good name, since everything > in this module > is stats. if we drop "_stats" then "relation" should be plural, to > match other functions in the module ("pages", "os_pages", > "numa_pages", "usage_counts").
I know we already have a couple of hand-aggregation functions but I am hesitant to add more of these. Question is where do we stop? For example, the current function is useless if someone wants to find the parts of a relation which are hot since it doesn't include page numbers. Do we write another function for the same? Or we add page numbers to this function and then there's hardly any aggregation happening. What if somebody wanted to perform an aggregation more complex than just count() like average number of buffers per relation or distribution of relation buffers in the cache, do they write separate functions? Another problem is the maintenance cost these functions bring. For example, with the resizable shared buffer project we have another function to stress test. Looking at the function, I see it uses a hash table to aggregate the data. To some extent it's duplicating the functionality we already have - aggregates using hashing. Are we going to duplicate functionality everywhere we require aggregation on top of a system function? These functions will then be missing any optimizations we do to hash aggregation in future. Can we instead investigate the reason the aggregation on top of pg_buffercache output requires so much more time than doing it in the function and fix that as much as we can? I know some slowness will come from tuplestore APIs, tuple formation and deformation but I won't expect it to be 10 times slower. -- Best Wishes, Ashutosh Bapat
