Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

2025-08-06 Thread Frits Hoogland
Joe, 

Can you name any technical reason why not having swap for a database is an 
actual bad idea?

Memory always is limited. Swap was invented to overcome a situation where the 
(incidental) memory usage of paged in memory was could (regularly) get higher 
than physical memory would allow, and thus have the (clear) workaround of 
having swap to 'cushion' the memory shortage issue by allowing a "second level" 
memory storage on disk.
Still, this does not making memory unlimited. Swap extends the physical memory 
available with the amount of swap. There still is a situation where you can run 
out of memory when swap is added, simply by paging in more memory than physical 
memory and swap.

Today, most systems are not memory constrained anymore, or: it is possible to 
get a server with enough physical memory to hold your common needed total 
memory need. 
And given the latency sensitive nature of databases in general, which includes 
postgres, for any serious deployment you should get a server with enough memory 
to host your workload, and configure postgres not to overload the memory.

If you do oversubscribe on (physical) memory, you will get pain somewhere, 
there is no way around that.
The article in defense of swap in essence is saying that if you happen to 
oversubscribe on memory, sharing the pain between anonymous and file is better.
I would say you are already in a bad place if that happens, which is especially 
bad for databases, and databases should allow you to make memory usage 
predictable.

However, what I found is that with 4+ kernels (4.18 to be precise; rhel 8), the 
kernel can try to favour file pages in certain situations making anonymous 
memory getting paged out even if swappiness is set to 1 or 0, and if there is a 
wealth of inactive file memory. It seems to have to do with workingset 
protection(?) mechanisms, but given the lack of clear statistics I can't be 
sure about that. What it does lead to in my situations is a constant rate of 
swapping in and out in certain situations, whilst there is no technical reason 
for linux to do so because there is enough available memory.

My point of view has been that vm.overcommit_memory set to 2 was the way to go, 
because that allows linux to limit based on a set limit on allocation time, 
which guarantees way to make the database never run out of memory.
it does guarantees linux to never run out of memory, absolutely.
However, this limit is hard, and is applied for the process at both usermode 
and system mode (kernel level), and thus can enforce not providing memory at 
times where it's not safe to do so, and thus corrupt execution. I have to be 
honest, I have not seen this myself, but trustworthy sources have reported this 
repeatedly, which I am inclined to believe. This means postgres execution can 
corrupt/terminate in unlucky situations, which is impacts availability.

 
Frits Hoogland




> On 5 Aug 2025, at 20:52, Joe Conway  wrote:
> 
> On 8/5/25 13:01, Priya V wrote:
>> *Environment:*
>>*PostgreSQL Versions:* Mix of 13.13 and 15.12 (upgrades in progress
>>to be at 15.12 currently both are actively in use)
> 
> PostgreSQL 13 end of life after November 13, 2025
> 
>>*OS / Kernel:* RHEL 7 & RHEL 8 variants, kernels in the 4.14–4.18 range
> 
> RHEL 7 has been EOL for quite a while now. Note that you have to watch out 
> for collation issues/corrupted indexes after OS upgrades due to collations 
> changing with newer glibc versions.
> 
>>*Swap:* Currently none
> 
> bad idea
> 
>>*Workload:* Highly mixed — OLTP-style internal apps with
>>unpredictable query patterns and connection counts
>>*Goal:* Uniform, safe memory settings across the fleet to avoid
>>kernel or database instability
> 
>> We’re considering:
>>*|vm.overcommit_memory = 2|* for strict accounting
> 
> yes
> 
>>Increasing |vm.overcommit_ratio| from 50 → 80 or 90 to better
>>reflect actual PostgreSQL usage (e.g., |work_mem| reservations that
>>aren’t fully used)
> 
> work_mem does not reserve memory -- it is a maximum that might be used in 
> memory for a particular operation
> 
>> *Our questions for those running large PostgreSQL fleets:*
>> 1.
>>What |overcommit_ratio| do you find safe for PostgreSQL without
>>causing kernel memory crunches?
> 
> Read this:
> https://www.cybertec-postgresql.com/en/what-you-should-know-about-linux-memory-overcommit-in-postgresql/
> 
>> 2.
>>Do you prefer |overcommit_memory = 1| or |= 2| for production stability?
> 
> Use overcommit_memory = 2 for production stability
> 
>> 3.
>>How much swap (if any) do you keep in large-memory servers where
>>PostgreSQL is the primary workload? Is having swap configured a good
>>idea or not ?
> 
> You don't necessary need a large amount of swap, but you definitely should 
> not disable it.
> 
> Some background on that:
> https://chrisdown.name/2018/01/02/in-defence-of-swap.html
> 
>> 4.
>>Any real-world cases where kernel account

Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

2025-08-06 Thread Priya V
Hi Frits, Joe,

Thank you both for you insights

*Current situation:*

*cat /proc/sys/vm/overcommit_memory*
0

*cat /proc/sys/vm/overcommit_ratio*
50

$ *cat /proc/sys/vm/swappiness*
60

*Workload*: Multi-tenant PostgreSQL

*uname -r*
4.18.0-477.83.1.el8_8.x86_64

*free -h*
total used free shared buff/cache available
Mem: 249Gi 4.3Gi 1.7Gi 22Gi 243Gi 221Gi
Swap: 0B 0B 0B

if we set overcommit_memory = 2, what should we set the overcommit_ration
value to ? Can you pls suggest ?
Is there a rule of thumb to go with ?

*Our goal is to not run into OOM issues, no memory wastage and also not
starve kernel ? *

Thanks!





On Wed, Aug 6, 2025 at 3:47 AM Frits Hoogland 
wrote:

> Joe,
>
> Can you name any technical reason why not having swap for a database is an
> actual bad idea?
>
> Memory always is limited. Swap was invented to overcome a situation where
> the (incidental) memory usage of paged in memory was could (regularly) get
> higher than physical memory would allow, and thus have the (clear)
> workaround of having swap to 'cushion' the memory shortage issue by
> allowing a "second level" memory storage on disk.
> Still, this does not making memory unlimited. Swap extends the physical
> memory available with the amount of swap. There still is a situation where
> you can run out of memory when swap is added, simply by paging in more
> memory than physical memory and swap.
>
> Today, most systems are not memory constrained anymore, or: it is possible
> to get a server with enough physical memory to hold your common needed
> total memory need.
> And given the latency sensitive nature of databases in general, which
> includes postgres, for any serious deployment you should get a server with
> enough memory to host your workload, and configure postgres not to overload
> the memory.
>
> If you do oversubscribe on (physical) memory, you will get pain somewhere,
> there is no way around that.
> The article in defense of swap in essence is saying that if you happen to
> oversubscribe on memory, sharing the pain between anonymous and file is
> better.
> I would say you are already in a bad place if that happens, which is
> especially bad for databases, and databases should allow you to make memory
> usage predictable.
>
> However, what I found is that with 4+ kernels (4.18 to be precise; rhel
> 8), the kernel can try to favour file pages in certain situations making
> anonymous memory getting paged out even if swappiness is set to 1 or 0, and
> if there is a wealth of inactive file memory. It seems to have to do with
> workingset protection(?) mechanisms, but given the lack of clear statistics
> I can't be sure about that. What it does lead to in my situations is a
> constant rate of swapping in and out in certain situations, whilst there is
> no technical reason for linux to do so because there is enough available
> memory.
>
> My point of view has been that vm.overcommit_memory set to 2 was the way
> to go, because that allows linux to limit based on a set limit on
> allocation time, which guarantees way to make the database never run out of
> memory.
> it does guarantees linux to never run out of memory, absolutely.
> However, this limit is hard, and is applied for the process at both
> usermode and system mode (kernel level), and thus can enforce not providing
> memory at times where it's not safe to do so, and thus corrupt execution. I
> have to be honest, I have not seen this myself, but trustworthy sources
> have reported this repeatedly, which I am inclined to believe. This means
> postgres execution can corrupt/terminate in unlucky situations, which is
> impacts availability.
>
>
> *Frits Hoogland*
>
>
>
>
> On 5 Aug 2025, at 20:52, Joe Conway  wrote:
>
> On 8/5/25 13:01, Priya V wrote:
>
> *Environment:*
>*PostgreSQL Versions:* Mix of 13.13 and 15.12 (upgrades in progress
>to be at 15.12 currently both are actively in use)
>
>
> PostgreSQL 13 end of life after November 13, 2025
>
>*OS / Kernel:* RHEL 7 & RHEL 8 variants, kernels in the 4.14–4.18 range
>
>
> RHEL 7 has been EOL for quite a while now. Note that you have to watch out
> for collation issues/corrupted indexes after OS upgrades due to collations
> changing with newer glibc versions.
>
>*Swap:* Currently none
>
>
> bad idea
>
>*Workload:* Highly mixed — OLTP-style internal apps with
>unpredictable query patterns and connection counts
>*Goal:* Uniform, safe memory settings across the fleet to avoid
>kernel or database instability
>
>
> We’re considering:
>*|vm.overcommit_memory = 2|* for strict accounting
>
>
> yes
>
>Increasing |vm.overcommit_ratio| from 50 → 80 or 90 to better
>reflect actual PostgreSQL usage (e.g., |work_mem| reservations that
>aren’t fully used)
>
>
> work_mem does not reserve memory -- it is a maximum that might be used in
> memory for a particular operation
>
> *Our questions for those running large PostgreSQL fleets:*
> 1.
>What |overcommit_ratio| do you find safe for Pos

Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

2025-08-06 Thread Joe Conway

(Both: please trim and reply inline on these lists as I have done;
 Frits, please reply all not just to the list -- I never received your
 reply to me)

On 8/6/25 11:51, Priya V wrote:

*cat /proc/sys/vm/overcommit_ratio*
50
$ *cat /proc/sys/vm/swappiness*
60

*Workload*: Multi-tenant PostgreSQL

*uname -r*
4.18.0-477.83.1.el8_8.x86_64


IMHO you should strongly consider getting on a more recent distro with a 
newer kernel.



*free -h*
total used free shared buff/cache available
Mem: 249Gi 4.3Gi 1.7Gi 22Gi 243Gi 221Gi
Swap: 0B 0B 0B


As I said, do not disable swap. You don't need a huge amount, but maybe 
16 GB or so would do it.


if we set overcommit_memory = 2, what should we set the 
overcommit_ration value to ? Can you pls suggest ?

Is there a rule of thumb to go with ?


There is no rule of thumb that I am aware of. Every workload is 
different. Start with something like 80 and do your own testing to 
refine that number.


*Our goal is to not run into OOM issues, no memory wastage and also not 
starve kernel ? *


With overcommit_memory = 2, swap on (and reasonably sized), and 
overcommit_ratio to something reasonable (certainly below 100), I think 
you will have a difficult time getting an OOM kill even if you try 
during testing. But you have to do your own testing for your workloads 
-- there is no magic button here.


That is, unless you run postgres in a cgroup with memory.limit (cgroup 
v1) or memory.max (cgroup v2) set. Note, running in containers with 
memory limits set e.g. via Kubernetes will do that under the covers. 
That is a completely different story.


On Wed, Aug 6, 2025 at 3:47 AM Frits Hoogland > wrote:

Can you name any technical reason why not having swap for a database
is an actual bad idea?


Did you read the blog I linked? Do your own experiments.

* Swap is what is used when anonymous memory must be reclaimed to allow 
for an allocation of anonymous memory.


* The Linux kernel will aggressively use all available memory for file 
buffers, pushing usage against the limits.


* Especially in the older 4 series kernels, file buffers often cannot be 
reclaimed fast enough


* With no swap and a large-ish anonymous memory request, it is easy to 
push over the limit to cause the OOM killer to strike.


* On the other hand, with swap enabled anon memory can be reclaimed 
giving the kernel more time to deal with file buffer reclamation.


At least that is what I have observed.

HTH,

--
Joe Conway
PostgreSQL Contributors Team
Amazon Web Services: https://aws.amazon.com




Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

2025-08-06 Thread Frits Hoogland
> As I said, do not disable swap. You don't need a huge amount, but maybe 16 GB 
> or so would do it.

Joe, please, can you state a technical reason for saying this?
All you are saying is ‘don’t do this’. 

I’ve stated my reasons for why this doesn’t make sense, and you don’t give any 
reason.

The article you cite does seem to point to general usage, not database usage.

Frits

> Op 6 aug 2025 om 18:33 heeft Joe Conway  het volgende 
> geschreven:
> 
> (Both: please trim and reply inline on these lists as I have done;
> Frits, please reply all not just to the list -- I never received your
> reply to me)
> 
>> On 8/6/25 11:51, Priya V wrote:
>> *cat /proc/sys/vm/overcommit_ratio*
>> 50
>> $ *cat /proc/sys/vm/swappiness*
>> 60
>> *Workload*: Multi-tenant PostgreSQL
>> *uname -r*
>> 4.18.0-477.83.1.el8_8.x86_64
> 
> IMHO you should strongly consider getting on a more recent distro with a 
> newer kernel.
> 
>> *free -h*
>> total used free shared buff/cache available
>> Mem: 249Gi 4.3Gi 1.7Gi 22Gi 243Gi 221Gi
>> Swap: 0B 0B 0B
> 
> As I said, do not disable swap. You don't need a huge amount, but maybe 16 GB 
> or so would do it.
> 
>> if we set overcommit_memory = 2, what should we set the overcommit_ration 
>> value to ? Can you pls suggest ?
>> Is there a rule of thumb to go with ?
> 
> There is no rule of thumb that I am aware of. Every workload is different. 
> Start with something like 80 and do your own testing to refine that number.
> 
>> *Our goal is to not run into OOM issues, no memory wastage and also not 
>> starve kernel ? *
> 
> With overcommit_memory = 2, swap on (and reasonably sized), and 
> overcommit_ratio to something reasonable (certainly below 100), I think you 
> will have a difficult time getting an OOM kill even if you try during 
> testing. But you have to do your own testing for your workloads -- there is 
> no magic button here.
> 
> That is, unless you run postgres in a cgroup with memory.limit (cgroup v1) or 
> memory.max (cgroup v2) set. Note, running in containers with memory limits 
> set e.g. via Kubernetes will do that under the covers. That is a completely 
> different story.
> 
>> On Wed, Aug 6, 2025 at 3:47 AM Frits Hoogland > > wrote:
>>Can you name any technical reason why not having swap for a database
>>is an actual bad idea?
> 
> Did you read the blog I linked? Do your own experiments.
> 
> * Swap is what is used when anonymous memory must be reclaimed to allow for 
> an allocation of anonymous memory.
> 
> * The Linux kernel will aggressively use all available memory for file 
> buffers, pushing usage against the limits.
> 
> * Especially in the older 4 series kernels, file buffers often cannot be 
> reclaimed fast enough
> 
> * With no swap and a large-ish anonymous memory request, it is easy to push 
> over the limit to cause the OOM killer to strike.
> 
> * On the other hand, with swap enabled anon memory can be reclaimed giving 
> the kernel more time to deal with file buffer reclamation.
> 
> At least that is what I have observed.
> 
> HTH,
> 
> --
> Joe Conway
> PostgreSQL Contributors Team
> Amazon Web Services: https://aws.amazon.com




Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

2025-08-06 Thread Joe Conway

On 8/6/25 17:14, Frits Hoogland wrote:

As I said, do not disable swap. You don't need a huge amount, but
maybe 16 GB or so would do it.



Joe, please, can you state a technical reason for saying this?
All you are saying is ‘don’t do this’.

I’ve stated my reasons for why this doesn’t make sense, and you don’t give any 
reason.


What do you call the below?


Op 6 aug 2025 om 18:33 heeft Joe Conway  het volgende 
geschreven:



* Swap is what is used when anonymous memory must be reclaimed to
allow for an allocation of anonymous memory.

* The Linux kernel will aggressively use all available memory for
file buffers, pushing usage against the limits.

* Especially in the older 4 series kernels, file buffers often
cannot be reclaimed fast enough

* With no swap and a large-ish anonymous memory request, it is
easy to push over the limit to cause the OOM killer to strike.

* On the other hand, with swap enabled anon memory can be
reclaimed giving the kernel more time to deal with file buffer
reclamation.

At least that is what I have observed.


If you don't think that is adequate technical reason, feel free to 
ignore my advice.


--
Joe Conway
PostgreSQL Contributors Team
Amazon Web Services: https://aws.amazon.com