date:20250626

Re: many sessions waiting DataFileRead and extend

2025-06-26 Thread Frits Hoogland

Postgres lives as a process in linux, and keeps its own cache, and tries to use 
that as much as possible for data. This is postgres shared buffers, commonly 
called the buffer cache. 

For WAL, sessions write to the wal buffer (separate from the postgres buffer 
cache), and need to write to disk upon commit with default settings. Please 
mind wal writes commonly are performed from the wal writer, but can be done 
from backends too.

(slightly simplified) When postgres processing needs to read data for 
processing, it will have to do that in the postgres buffer cache and use it 
from the cache if it can find the requested buffer, and otherwise it will have 
to read it from the operating system. When changes are made to data or metadata 
(such as hint bits set), the change is made to the buffer (and before that a 
description of this is placed in the wal buffer). As long as there are buffers 
available, it will perform the data changes in the buffer in the buffer cache. 
However, the buffer cache is a finite amount of buffers, and eventually changed 
buffers must make it to disk (which happens upon checkpoint, and for other 
reasons). The important part here is that checkpointer, bgwriter and backends 
all perform reads and writes. That is what postgres sees, it only knows and 
sees it's performing a read or write request.

These reads and writes are performed to the operating system, and the operating 
system essentially applies the same technique for performance as postgres does: 
if an IO is done buffered (which is default), it creates as a page in the linux 
page cache. For reads, if the read request happens to find the pages it 
requests in the linux page cache, it can serve it to the postgres process 
without actually needing to perform an actual read request to the operating 
system. For buffered writes, these are always done to the linux page cache, and 
written back by kernel writer threads asynchronously. (this is the essence, 
there is much more detail). Obviously, the amount linux can cache is dependent 
on the amount of cache and whether the pages for the request are in the linux 
page cache. For writes, it's dependent on commonly the setting vm.dirty_ratio, 
which is a percentage from available memory (commonly misunderstood as taken 
from total memory). 

Now, the point I am trying to make: from the postgres point of view, it's 
impossible to understand whether a read or write request was served from OS 
cache by the operating system, or needed physical IO. However, the ability of 
the operating system to serve from cache is dependent upon the availability of 
free and file memory essentially, and therefore upon general usage of the 
operating system. Any action on the operating system that needs a significant 
amount of memory will impact the availability of available memory and therefore 
lower the amount of memory available to caching. This all in all means that if 
you do something significant on the operating system, it is perfectly possible 
to perceived that from postgres as sudden drastic change in performance of IO 
latency, whilst from the postgres perspective you didn't do anything different.

Now circling back to you concrete question: if such a thing happens, that linux 
all of a sudden needs to do physical IOs and let's your request wait on that, 
instead doing logical IO, where your session has to wait on too, but 
significantly less time, you might all of a sudden see all kind of IO related 
waits, which you are encountering too in the case of the fast logical IOs, but 
taking soo little time that you don't see it. This is where wait event 
occurence and time accumulation would significantly help, currently we can only 
sample the wait event state, and thus much of that is never seen, and thus not 
known.

Frits Hoogland

> On 26 Jun 2025, at 08:47, James Pang  wrote:
> 
>we faced this issue 3 times this week, each time last only 2 seconds, so 
> not easy to run perf in peak business time to capture that, anyway, I will 
> try. before that, I want to understand if "os page cache" or "pg buffer 
> cache" can contribute to the wait_event time "extend" and "DataFileRead", or 
> bgwriter ,checkpoint flushing data to disk can impact that too ?  we enable 
> bgwriter , and we see checkpointer get scheduled by "wal" during the time, so 
> I just increased max_wal_size to make checkpoint scheduled in longer time. 
> 
> Thanks,
> 
> James 
> 
> 
> Frits Hoogland mailto:[email protected]>> 於 
> 2025年6月26日週四 下午2:40寫道：
>> Okay. So it's a situation that is reproducable.
>> And like was mentioned, the system time (percentage) is very high.
>> Is this a physical machine, or a virtual machine?
>> 
>> The next thing to do, is use perf to record about 20 seconds or so during a 
>> period of time when you see this behavior (perf record -g, taking the 
>> backtrace with it).
>> This records (samples) the backtraces of on cpu tasks, from which you then 
>> can derive what they are do

Re: many sessions waiting DataFileRead and extend

2025-06-26 Thread James Pang

thanks for you explaination,  from Postgresql perspective, is it possible
to see bgwriter,or checkpointer blocking backend process  reading/wrting ?
or vice versa ?

Thanks,

James

Frits Hoogland  於 2025年6月26日週四 下午4:07寫道：

> Postgres lives as a process in linux, and keeps its own cache, and tries
> to use that as much as possible for data. This is postgres shared buffers,
> commonly called the buffer cache.
>
> For WAL, sessions write to the wal buffer (separate from the postgres
> buffer cache), and need to write to disk upon commit with default settings.
> Please mind wal writes commonly are performed from the wal writer, but can
> be done from backends too.
>
> (slightly simplified) When postgres processing needs to read data for
> processing, it will have to do that in the postgres buffer cache and use it
> from the cache if it can find the requested buffer, and otherwise it will
> have to read it from the operating system. When changes are made to data or
> metadata (such as hint bits set), the change is made to the buffer (and
> before that a description of this is placed in the wal buffer). As long as
> there are buffers available, it will perform the data changes in the buffer
> in the buffer cache. However, the buffer cache is a finite amount of
> buffers, and eventually changed buffers must make it to disk (which happens
> upon checkpoint, and for other reasons). The important part here is that
> checkpointer, bgwriter and backends all perform reads and writes. That is
> what postgres sees, it only knows and sees it's performing a read or write
> request.
>
> These reads and writes are performed to the operating system, and the
> operating system essentially applies the same technique for performance as
> postgres does: if an IO is done buffered (which is default), it creates as
> a page in the linux page cache. For reads, if the read request happens to
> find the pages it requests in the linux page cache, it can serve it to the
> postgres process without actually needing to perform an actual read request
> to the operating system. For buffered writes, these are always done to the
> linux page cache, and written back by kernel writer threads asynchronously.
> (this is the essence, there is much more detail). Obviously, the amount
> linux can cache is dependent on the amount of cache and whether the pages
> for the request are in the linux page cache. For writes, it's dependent on
> commonly the setting vm.dirty_ratio, which is a percentage from *available
> *memory (commonly misunderstood as taken from total memory).
>
> Now, the point I am trying to make: from the postgres point of view, it's
> impossible to understand whether a read or write request was served from OS
> cache by the operating system, or needed physical IO. However, the ability
> of the operating system to serve from cache is dependent upon the
> availability of free and file memory essentially, and therefore upon
> general usage of the operating system. Any action on the operating system
> that needs a significant amount of memory will impact the availability of
> available memory and therefore lower the amount of memory available to
> caching. This all in all means that if you do something significant on the
> operating system, it is perfectly possible to perceived that from postgres
> as sudden drastic change in performance of IO latency, whilst from the
> postgres perspective you didn't do anything different.
>
> Now circling back to you concrete question: if such a thing happens, that
> linux all of a sudden needs to do physical IOs and let's your request wait
> on that, instead doing logical IO, where your session has to wait on too,
> but significantly less time, you might all of a sudden see all kind of IO
> related waits, which you are encountering too in the case of the fast
> logical IOs, but taking soo little time that you don't see it. This is
> where wait event occurence and time accumulation would significantly help,
> currently we can only sample the wait event state, and thus much of that is
> never seen, and thus not known.
>
> *Frits Hoogland*
>
>
>
>
> On 26 Jun 2025, at 08:47, James Pang  wrote:
>
>we faced this issue 3 times this week, each time last only 2 seconds,
> so not easy to run perf in peak business time to capture that, anyway, I
> will try. before that, I want to understand if "os page cache" or "pg
> buffer cache" can contribute to the wait_event time "extend" and
> "DataFileRead", or bgwriter ,checkpoint flushing data to disk can impact
> that too ?  we enable bgwriter , and we see checkpointer get scheduled by
> "wal" during the time, so I just increased max_wal_size to make checkpoint
> scheduled in longer time.
>
> Thanks,
>
> James
>
>
> Frits Hoogland  於 2025年6月26日週四 下午2:40寫道：
>
>> Okay. So it's a situation that is reproducable.
>> And like was mentioned, the system time (percentage) is very high.
>> Is this a physical machine, or a virtual machine?
>>
>> The next thing to do, is use

Re: many sessions waiting DataFileRead and extend

2025-06-26 Thread James Pang

in addition to "DataFileRead", actually we have more session waiting on
"extend", and we enabled log_lock, for example
2025-06-24 18:00:11.368 :[1865315]:[4-1]:mbsLOG:  process 1865315 still
waiting for ExclusiveLock on extension of relation 14658239 of database
16384 after 1000.161 ms
2025-06-24 18:00:11.368 :mbs@pgaaweb:[1865315]:[5-1]:mbsDETAIL:
 Process holding the lock: 1748100. Wait queue: 232658, 606235, 1865315,
1939296, 1368901, 2472806, 2072266, 963893, 1304751, 1084094, 875968,
451652, 672314, 1240983, 2085501, 851401, 2043926, 2522611, 889816, 191663,
1090944, 536057, 1810868, 1240510, 1195564, 1067298.

see both "DataFileRead" and "extend" , each time last 2 seconds and
automatically recoverred.  for this "extend" wait_event, in addtion IO, OS
page_cache and PG buffer cache, or bgwriter,checkpointer could be impact
that too?

Thanks,

James

James Pang  於 2025年6月26日週四 下午2:47寫道：

>we faced this issue 3 times this week, each time last only 2 seconds,
> so not easy to run perf in peak business time to capture that, anyway, I
> will try. before that, I want to understand if "os page cache" or "pg
> buffer cache" can contribute to the wait_event time "extend" and
> "DataFileRead", or bgwriter ,checkpoint flushing data to disk can impact
> that too ?  we enable bgwriter , and we see checkpointer get scheduled by
> "wal" during the time, so I just increased max_wal_size to make checkpoint
> scheduled in longer time.
>
> Thanks,
>
> James
>
>
> Frits Hoogland  於 2025年6月26日週四 下午2:40寫道：
>
>> Okay. So it's a situation that is reproducable.
>> And like was mentioned, the system time (percentage) is very high.
>> Is this a physical machine, or a virtual machine?
>>
>> The next thing to do, is use perf to record about 20 seconds or so during
>> a period of time when you see this behavior (perf record -g, taking the
>> backtrace with it).
>> This records (samples) the backtraces of on cpu tasks, from which you
>> then can derive what they are doing, for which you should see lots of tasks
>> in kernel space, and what that is, using perf report.
>>
>> *Frits Hoogland*
>>
>>
>>
>>
>> On 26 Jun 2025, at 04:32, James Pang  wrote:
>>
>> thans for you suggestions, we have iowait from sar command too, copy
>> here, checking with infra team not found abnormal IO activities either.
>> 02:00:01 PM  CPU%usr   %nice%sys %iowait%irq   %soft  %steal
>>  %guest  %gnice   %idle
>> 02:00:03 PM  all 15.920.00   43.020.650.762.56
>>  0.000.000.00   37.09
>> 02:00:03 PM0 17.590.00   46.731.010.500.50
>>  0.000.000.00   33.67
>> 02:00:03 PM1  9.50 0.00   61.500.500.501.00
>>  0.000.000.00   27.00
>> 02:00:03 PM2  20.710.00   44.441.010.510.51
>>  0.000.000.00   32.83
>> 02:00:03 PM3 14.000.00   51.502.001.001.00
>>  0.000.000.00   30.50
>> 02:00:03 PM4  6.57 0.00   52.530.510.513.54
>>  0.000.000.00   36.36
>> 02:00:03 PM5  10.200.00   49.491.021.530.00
>>  0.000.000.00   37.76
>> 02:00:03 PM6  27.640.00   41.210.500.500.50
>>  0.000.000.00   29.65
>> 02:00:03 PM7  9.050.00   50.750.501.010.50
>>  0.000.000.00   38.19
>> 02:00:03 PM8  12.180.00   49.750.510.510.51
>>  0.000.000.00   36.55
>> 02:00:03 PM9  13.000.009.500.501.50   15.50
>>  0.000.000.00   60.00
>> 02:00:03 PM   10 15.580.00   46.230.000.500.50
>>  0.000.000.00   37.19
>> 02:00:03 PM   11  20.710.00   10.100.001.01   14.14
>>  0.000.000.00   54.04
>> 02:00:03 PM   12 21.000.00   37.000.501.001.00
>>  0.000.000.00   39.50
>> 02:00:03 PM   13 13.570.00   45.731.011.011.01
>>  0.000.000.00   37.69
>> 02:00:03 PM   14   18.180.00   39.391.010.510.510.00
>>0.000.00   40.40
>> 02:00:03 PM   15   14.000.00   49.500.500.503.500.00
>>0.000.00   32.00
>> 02:00:03 PM   16   19.390.00   39.801.021.530.510.00
>>0.000.00   37.76
>> 02:00:03 PM   17   16.750.00   45.181.521.022.540.00
>>0.000.00   32.99
>> 02:00:03 PM   18   12.630.00   50.000.001.010.000.00
>>0.000.00   36.36
>> 02:00:03 PM   195.560.00   82.320.000.000.000.00
>>0.000.00   12.12
>> 02:00:03 PM   20   15.080.00   48.240.500.503.520.00
>>0.000.00   32.16
>> 02:00:03 PM   21   17.680.009.090.511.52   13.640.00
>>0.000.00   57.58
>> 02:00:03 PM   22   13.130.00   43.940.510.510.510.00
>>0.000.00   41.41
>> 02:00:03 PM   23   14.070.

Re: many sessions waiting DataFileRead and extend

2025-06-26 Thread Robert Haas

On Thu, Jun 26, 2025 at 2:48 AM James Pang  wrote:
>we faced this issue 3 times this week, each time last only 2 seconds, so 
> not easy to run perf in peak business time to capture that, anyway, I will 
> try. before that, I want to understand if "os page cache" or "pg buffer 
> cache" can contribute to the wait_event time "extend" and "DataFileRead", or 
> bgwriter ,checkpoint flushing data to disk can impact that too ?  we enable 
> bgwriter , and we see checkpointer get scheduled by "wal" during the time, so 
> I just increased max_wal_size to make checkpoint scheduled in longer time.

To be honest, given the additional information that this only lasted 2
seconds, this doesn't sound terribly abnormal to me. My guess would be
that a bunch of processes all tried to insert into the same relation
at the time, so PostgreSQL had to make it bigger, and the OS couldn't
extend the file quite quickly enough.

I don't want to make it sound like there's absolutely no problem here:
PostgreSQL's relation extension logic isn't as performant as we'd like
it to be, and there have been several attempts to fix it over the
years. I don't know if the last attempt was later than PG14, so you
might have more success on a newer version, but I'm also not really
sure if the limitations of the implementation are your main problem
here.

If I were troubleshooting this, I'd try to capture a snapshot of what
queries were running at the moment the problem happens. If, for
example, all the sessions suddenly decide to insert into the same
table at the same time, it might be good to think about whether
there's a way to avoid that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: many sessions waiting DataFileRead and extend

Re: many sessions waiting DataFileRead and extend

Re: many sessions waiting DataFileRead and extend

Re: many sessions waiting DataFileRead and extend

4 matches

Site Navigation

Mail list logo

Footer information