Re: many sessions waiting DataFileRead and extend
Postgres lives as a process in linux, and keeps its own cache, and tries to use that as much as possible for data. This is postgres shared buffers, commonly called the buffer cache. For WAL, sessions write to the wal buffer (separate from the postgres buffer cache), and need to write to disk upon commit with default settings. Please mind wal writes commonly are performed from the wal writer, but can be done from backends too. (slightly simplified) When postgres processing needs to read data for processing, it will have to do that in the postgres buffer cache and use it from the cache if it can find the requested buffer, and otherwise it will have to read it from the operating system. When changes are made to data or metadata (such as hint bits set), the change is made to the buffer (and before that a description of this is placed in the wal buffer). As long as there are buffers available, it will perform the data changes in the buffer in the buffer cache. However, the buffer cache is a finite amount of buffers, and eventually changed buffers must make it to disk (which happens upon checkpoint, and for other reasons). The important part here is that checkpointer, bgwriter and backends all perform reads and writes. That is what postgres sees, it only knows and sees it's performing a read or write request. These reads and writes are performed to the operating system, and the operating system essentially applies the same technique for performance as postgres does: if an IO is done buffered (which is default), it creates as a page in the linux page cache. For reads, if the read request happens to find the pages it requests in the linux page cache, it can serve it to the postgres process without actually needing to perform an actual read request to the operating system. For buffered writes, these are always done to the linux page cache, and written back by kernel writer threads asynchronously. (this is the essence, there is much more detail). Obviously, the amount linux can cache is dependent on the amount of cache and whether the pages for the request are in the linux page cache. For writes, it's dependent on commonly the setting vm.dirty_ratio, which is a percentage from available memory (commonly misunderstood as taken from total memory). Now, the point I am trying to make: from the postgres point of view, it's impossible to understand whether a read or write request was served from OS cache by the operating system, or needed physical IO. However, the ability of the operating system to serve from cache is dependent upon the availability of free and file memory essentially, and therefore upon general usage of the operating system. Any action on the operating system that needs a significant amount of memory will impact the availability of available memory and therefore lower the amount of memory available to caching. This all in all means that if you do something significant on the operating system, it is perfectly possible to perceived that from postgres as sudden drastic change in performance of IO latency, whilst from the postgres perspective you didn't do anything different. Now circling back to you concrete question: if such a thing happens, that linux all of a sudden needs to do physical IOs and let's your request wait on that, instead doing logical IO, where your session has to wait on too, but significantly less time, you might all of a sudden see all kind of IO related waits, which you are encountering too in the case of the fast logical IOs, but taking soo little time that you don't see it. This is where wait event occurence and time accumulation would significantly help, currently we can only sample the wait event state, and thus much of that is never seen, and thus not known. Frits Hoogland > On 26 Jun 2025, at 08:47, James Pang wrote: > >we faced this issue 3 times this week, each time last only 2 seconds, so > not easy to run perf in peak business time to capture that, anyway, I will > try. before that, I want to understand if "os page cache" or "pg buffer > cache" can contribute to the wait_event time "extend" and "DataFileRead", or > bgwriter ,checkpoint flushing data to disk can impact that too ? we enable > bgwriter , and we see checkpointer get scheduled by "wal" during the time, so > I just increased max_wal_size to make checkpoint scheduled in longer time. > > Thanks, > > James > > > Frits Hoogland mailto:[email protected]>> 於 > 2025年6月26日週四 下午2:40寫道: >> Okay. So it's a situation that is reproducable. >> And like was mentioned, the system time (percentage) is very high. >> Is this a physical machine, or a virtual machine? >> >> The next thing to do, is use perf to record about 20 seconds or so during a >> period of time when you see this behavior (perf record -g, taking the >> backtrace with it). >> This records (samples) the backtraces of on cpu tasks, from which you then >> can derive what they are do
Re: many sessions waiting DataFileRead and extend
thanks for you explaination, from Postgresql perspective, is it possible to see bgwriter,or checkpointer blocking backend process reading/wrting ? or vice versa ? Thanks, James Frits Hoogland 於 2025年6月26日週四 下午4:07寫道: > Postgres lives as a process in linux, and keeps its own cache, and tries > to use that as much as possible for data. This is postgres shared buffers, > commonly called the buffer cache. > > For WAL, sessions write to the wal buffer (separate from the postgres > buffer cache), and need to write to disk upon commit with default settings. > Please mind wal writes commonly are performed from the wal writer, but can > be done from backends too. > > (slightly simplified) When postgres processing needs to read data for > processing, it will have to do that in the postgres buffer cache and use it > from the cache if it can find the requested buffer, and otherwise it will > have to read it from the operating system. When changes are made to data or > metadata (such as hint bits set), the change is made to the buffer (and > before that a description of this is placed in the wal buffer). As long as > there are buffers available, it will perform the data changes in the buffer > in the buffer cache. However, the buffer cache is a finite amount of > buffers, and eventually changed buffers must make it to disk (which happens > upon checkpoint, and for other reasons). The important part here is that > checkpointer, bgwriter and backends all perform reads and writes. That is > what postgres sees, it only knows and sees it's performing a read or write > request. > > These reads and writes are performed to the operating system, and the > operating system essentially applies the same technique for performance as > postgres does: if an IO is done buffered (which is default), it creates as > a page in the linux page cache. For reads, if the read request happens to > find the pages it requests in the linux page cache, it can serve it to the > postgres process without actually needing to perform an actual read request > to the operating system. For buffered writes, these are always done to the > linux page cache, and written back by kernel writer threads asynchronously. > (this is the essence, there is much more detail). Obviously, the amount > linux can cache is dependent on the amount of cache and whether the pages > for the request are in the linux page cache. For writes, it's dependent on > commonly the setting vm.dirty_ratio, which is a percentage from *available > *memory (commonly misunderstood as taken from total memory). > > Now, the point I am trying to make: from the postgres point of view, it's > impossible to understand whether a read or write request was served from OS > cache by the operating system, or needed physical IO. However, the ability > of the operating system to serve from cache is dependent upon the > availability of free and file memory essentially, and therefore upon > general usage of the operating system. Any action on the operating system > that needs a significant amount of memory will impact the availability of > available memory and therefore lower the amount of memory available to > caching. This all in all means that if you do something significant on the > operating system, it is perfectly possible to perceived that from postgres > as sudden drastic change in performance of IO latency, whilst from the > postgres perspective you didn't do anything different. > > Now circling back to you concrete question: if such a thing happens, that > linux all of a sudden needs to do physical IOs and let's your request wait > on that, instead doing logical IO, where your session has to wait on too, > but significantly less time, you might all of a sudden see all kind of IO > related waits, which you are encountering too in the case of the fast > logical IOs, but taking soo little time that you don't see it. This is > where wait event occurence and time accumulation would significantly help, > currently we can only sample the wait event state, and thus much of that is > never seen, and thus not known. > > *Frits Hoogland* > > > > > On 26 Jun 2025, at 08:47, James Pang wrote: > >we faced this issue 3 times this week, each time last only 2 seconds, > so not easy to run perf in peak business time to capture that, anyway, I > will try. before that, I want to understand if "os page cache" or "pg > buffer cache" can contribute to the wait_event time "extend" and > "DataFileRead", or bgwriter ,checkpoint flushing data to disk can impact > that too ? we enable bgwriter , and we see checkpointer get scheduled by > "wal" during the time, so I just increased max_wal_size to make checkpoint > scheduled in longer time. > > Thanks, > > James > > > Frits Hoogland 於 2025年6月26日週四 下午2:40寫道: > >> Okay. So it's a situation that is reproducable. >> And like was mentioned, the system time (percentage) is very high. >> Is this a physical machine, or a virtual machine? >> >> The next thing to do, is use
Re: many sessions waiting DataFileRead and extend
in addition to "DataFileRead", actually we have more session waiting on "extend", and we enabled log_lock, for example 2025-06-24 18:00:11.368 :[1865315]:[4-1]:mbsLOG: process 1865315 still waiting for ExclusiveLock on extension of relation 14658239 of database 16384 after 1000.161 ms 2025-06-24 18:00:11.368 :mbs@pgaaweb:[1865315]:[5-1]:mbsDETAIL: Process holding the lock: 1748100. Wait queue: 232658, 606235, 1865315, 1939296, 1368901, 2472806, 2072266, 963893, 1304751, 1084094, 875968, 451652, 672314, 1240983, 2085501, 851401, 2043926, 2522611, 889816, 191663, 1090944, 536057, 1810868, 1240510, 1195564, 1067298. see both "DataFileRead" and "extend" , each time last 2 seconds and automatically recoverred. for this "extend" wait_event, in addtion IO, OS page_cache and PG buffer cache, or bgwriter,checkpointer could be impact that too? Thanks, James James Pang 於 2025年6月26日週四 下午2:47寫道: >we faced this issue 3 times this week, each time last only 2 seconds, > so not easy to run perf in peak business time to capture that, anyway, I > will try. before that, I want to understand if "os page cache" or "pg > buffer cache" can contribute to the wait_event time "extend" and > "DataFileRead", or bgwriter ,checkpoint flushing data to disk can impact > that too ? we enable bgwriter , and we see checkpointer get scheduled by > "wal" during the time, so I just increased max_wal_size to make checkpoint > scheduled in longer time. > > Thanks, > > James > > > Frits Hoogland 於 2025年6月26日週四 下午2:40寫道: > >> Okay. So it's a situation that is reproducable. >> And like was mentioned, the system time (percentage) is very high. >> Is this a physical machine, or a virtual machine? >> >> The next thing to do, is use perf to record about 20 seconds or so during >> a period of time when you see this behavior (perf record -g, taking the >> backtrace with it). >> This records (samples) the backtraces of on cpu tasks, from which you >> then can derive what they are doing, for which you should see lots of tasks >> in kernel space, and what that is, using perf report. >> >> *Frits Hoogland* >> >> >> >> >> On 26 Jun 2025, at 04:32, James Pang wrote: >> >> thans for you suggestions, we have iowait from sar command too, copy >> here, checking with infra team not found abnormal IO activities either. >> 02:00:01 PM CPU%usr %nice%sys %iowait%irq %soft %steal >> %guest %gnice %idle >> 02:00:03 PM all 15.920.00 43.020.650.762.56 >> 0.000.000.00 37.09 >> 02:00:03 PM0 17.590.00 46.731.010.500.50 >> 0.000.000.00 33.67 >> 02:00:03 PM1 9.50 0.00 61.500.500.501.00 >> 0.000.000.00 27.00 >> 02:00:03 PM2 20.710.00 44.441.010.510.51 >> 0.000.000.00 32.83 >> 02:00:03 PM3 14.000.00 51.502.001.001.00 >> 0.000.000.00 30.50 >> 02:00:03 PM4 6.57 0.00 52.530.510.513.54 >> 0.000.000.00 36.36 >> 02:00:03 PM5 10.200.00 49.491.021.530.00 >> 0.000.000.00 37.76 >> 02:00:03 PM6 27.640.00 41.210.500.500.50 >> 0.000.000.00 29.65 >> 02:00:03 PM7 9.050.00 50.750.501.010.50 >> 0.000.000.00 38.19 >> 02:00:03 PM8 12.180.00 49.750.510.510.51 >> 0.000.000.00 36.55 >> 02:00:03 PM9 13.000.009.500.501.50 15.50 >> 0.000.000.00 60.00 >> 02:00:03 PM 10 15.580.00 46.230.000.500.50 >> 0.000.000.00 37.19 >> 02:00:03 PM 11 20.710.00 10.100.001.01 14.14 >> 0.000.000.00 54.04 >> 02:00:03 PM 12 21.000.00 37.000.501.001.00 >> 0.000.000.00 39.50 >> 02:00:03 PM 13 13.570.00 45.731.011.011.01 >> 0.000.000.00 37.69 >> 02:00:03 PM 14 18.180.00 39.391.010.510.510.00 >>0.000.00 40.40 >> 02:00:03 PM 15 14.000.00 49.500.500.503.500.00 >>0.000.00 32.00 >> 02:00:03 PM 16 19.390.00 39.801.021.530.510.00 >>0.000.00 37.76 >> 02:00:03 PM 17 16.750.00 45.181.521.022.540.00 >>0.000.00 32.99 >> 02:00:03 PM 18 12.630.00 50.000.001.010.000.00 >>0.000.00 36.36 >> 02:00:03 PM 195.560.00 82.320.000.000.000.00 >>0.000.00 12.12 >> 02:00:03 PM 20 15.080.00 48.240.500.503.520.00 >>0.000.00 32.16 >> 02:00:03 PM 21 17.680.009.090.511.52 13.640.00 >>0.000.00 57.58 >> 02:00:03 PM 22 13.130.00 43.940.510.510.510.00 >>0.000.00 41.41 >> 02:00:03 PM 23 14.070.
Re: many sessions waiting DataFileRead and extend
On Thu, Jun 26, 2025 at 2:48 AM James Pang wrote: >we faced this issue 3 times this week, each time last only 2 seconds, so > not easy to run perf in peak business time to capture that, anyway, I will > try. before that, I want to understand if "os page cache" or "pg buffer > cache" can contribute to the wait_event time "extend" and "DataFileRead", or > bgwriter ,checkpoint flushing data to disk can impact that too ? we enable > bgwriter , and we see checkpointer get scheduled by "wal" during the time, so > I just increased max_wal_size to make checkpoint scheduled in longer time. To be honest, given the additional information that this only lasted 2 seconds, this doesn't sound terribly abnormal to me. My guess would be that a bunch of processes all tried to insert into the same relation at the time, so PostgreSQL had to make it bigger, and the OS couldn't extend the file quite quickly enough. I don't want to make it sound like there's absolutely no problem here: PostgreSQL's relation extension logic isn't as performant as we'd like it to be, and there have been several attempts to fix it over the years. I don't know if the last attempt was later than PG14, so you might have more success on a newer version, but I'm also not really sure if the limitations of the implementation are your main problem here. If I were troubleshooting this, I'd try to capture a snapshot of what queries were running at the moment the problem happens. If, for example, all the sessions suddenly decide to insert into the same table at the same time, it might be good to think about whether there's a way to avoid that. -- Robert Haas EDB: http://www.enterprisedb.com
