Hello, We've been struggling with some major performance issues related to autovacuum. (I know this is a common problem.) For a while, we believed this was primarily related to I/O contention, but recent observations from our monitoring make me wonder if there's a deeper issue here, so I'm looking for some better understanding.
>From reading the 9.6.3 source, it looks like the autovacuum process itself is single-threaded, and it reads pages essentially linearly from the relation (possibly skipping some). When the autovacuum process needs to modify a page, it doesn't write it directly, but rather marks the buffer dirty. The page will be written later, either by the checkpointer (according to its configuration, in terms of time and WAL), the bgwriter (according to its configuration, in terms of write count and sleep time), or else some other backend process that requires a free buffer (if the count of non-dirty buffers reaches zero). Is this accurate? In our case, we found that when autovacuum runs, the number of dirty buffers written by regular backends shoots from a handful to as much as 300 buffers per second. (We have 200 backends on most databases.) More specifically: prior to recent autovacuums starting, databases did under 1000 buffer allocations per second. Most of those were done by the checkpointer. Individual backends did just a handful. After autovacuum started, buffer allocations rose to between 2,000 and 8,000 per second. The checkpointer handled many of these, but so did individual backends (see above). The bgwriter peaked around 25 buffer writes per second. So it seems like the spike in buffers written by normal backends could explain the significant degradation that we see in average latency and overall throughput (which can be as much as 50%). It looks to me like the autovacuum process is effectively generating work (in the form of async writes) that's being distributed implicitly to the various backend processes, creating latency for any other query that happens to require a buffer (including read-only queries). Maybe we can improve this by tuning the bgwriter. But if that's single-threaded, presumably there's an upper bound to how many buffer writes it can handle? Is there anything to guarantee that the bgwriter will keep up with the work that's being generated? That is, for some workloads, wouldn't it be possible that autovacuum could always generate work faster than the bgwriter can do it, and you'd always have some pretty major degradation to all other queries? I've drawn a lot of inferences here, and I'm sure there are some mistakes. I'd appreciate any clarification, correction, or confirmation! Thanks in advance, Dave