Postgresql server gets stuck at low load

2020-06-05 Thread Krzysztof Olszewski
I have problem with one of my Postgres production server. Server works fine
almost always, but sometimes without any increase of transactions or
statements amount, machine gets stuck. Cores goes up to 100%, load up to
160%. When it happens then there are problems with connect to database and
even it will succeed, simple queries works several seconds instead of
milliseconds.Problem sometimes stops after a period a time (e.g. 35 min),
sometimes we must restart Postgres, Linux, or even KVM (which exists as
virtualization host).

My hardware
56 cores (Intel Core Processor (Skylake, IBRS))
400 GB RAM
RAID10 with about 40k IOPS

Os
CentOS Linux release 7.7.1908
kernel 3.10.0-1062.18.1.el7.x86_64

Databasesize 100 GB (entirely fit in memory :) )
server_version 10.12
effective_cache_size 192000 MB
maintenance_work_mem 2048 MB
max_connections 150
shared_buffers 64000 MB
work_mem 96 MB

On normal state, i have about 500 tps, 5% usage of cores, about 3% of load,
whole database fits in memory, no reads from disk, only writes on about 500
IOPS level, sometimes in spikes on 1500 IOPS level, but on this hardware
there is no problem with this values (no iowaits on cores). In normal state
this machine does "nothing". Connections to database are created by two app
servers based on Java, through connection pools, so connections count is
limited by configuration of pools and max is 120, is lower value than in
Postgres configuration (150). On normal state there is about 20
connections, when stuck goes into max (120).

In correlation with stucks i see informations in kernel log about
NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s! [postmaster:33935]
but i don't know this is reason or effect of problem
I made investigation with pgBadger and ... nothing strange happens, just
normal statements

Any ideas?

Thanks,
Kris


Re: Postgresql server gets stuck at low load

2020-06-05 Thread luis . roberto




De: "Krzysztof Olszewski"  
Para: [email protected] 
Enviadas: Sexta-feira, 5 de junho de 2020 7:07:02 
Assunto: Postgresql server gets stuck at low load 





BQ_BEGIN

I have problem with one of my Postgres production server. Server works fine 
almost always, but sometimes without any increase of transactions or statements 
amount, machine gets stuck. Cores goes up to 100%, load up to 160%. When it 
happens then there are problems with connect to database and even it will 
succeed, simple queries works several seconds instead of milliseconds.Problem 
sometimes stops after a period a time (e.g. 35 min), sometimes we must restart 
Postgres, Linux, or even KVM (which exists as virtualization host). 
My hardware56 cores (Intel Core Processor (Skylake, IBRS))400 GB RAMRAID10 with 
about 40k IOPS 
Os 
CentOS Linux release 7.7.1908 
kernel 3.10.0-1062.18.1.el7.x86_64 Databasesize 100 GB (entirely fit in memory 
:) )server_version 10.12effective_cache_size 192000 MBmaintenance_work_mem 2048 
MBmax_connections 150 shared_buffers 64000 MBwork_mem 96 MBOn normal state, i 
have about 500 tps, 5% usage of cores, about 3% of load, whole database fits in 
memory, no reads from disk, only writes on about 500 IOPS level, sometimes in 
spikes on 1500 IOPS level, but on this hardware there is no problem with this 
values (no iowaits on cores). In normal state this machine does "nothing". 
Connections to database are created by two app servers based on Java, through 
connection pools, so connections count is limited by configuration of pools and 
max is 120, is lower value than in Postgres configuration (150). On normal 
state there is about 20 connections, when stuck goes into max (120).In 
correlation with stucks i see informations in kernel log aboutNMI watchdog: 
BUG: soft lockup - CPU#25 stuck for 23s! [postmaster:33935]but i don't know 
this is reason or effect of problemI made investigation with pgBadger and ... 
nothing strange happens, just normal statements Any ideas? Thanks, 
Kris 

BQ_END

Hi Krzysztof! 

I would enable pg_stat_statements extension and check if there are long running 
queries that should be quick. 


Re: Postgresql server gets stuck at low load

2020-06-05 Thread Pavel Stehule
pá 5. 6. 2020 v 12:07 odesílatel Krzysztof Olszewski 
napsal:

> I have problem with one of my Postgres production server. Server works
> fine almost always, but sometimes without any increase of transactions or
> statements amount, machine gets stuck. Cores goes up to 100%, load up to
> 160%. When it happens then there are problems with connect to database and
> even it will succeed, simple queries works several seconds instead of
> milliseconds.Problem sometimes stops after a period a time (e.g. 35 min),
> sometimes we must restart Postgres, Linux, or even KVM (which exists as
> virtualization host).
>
> My hardware
> 56 cores (Intel Core Processor (Skylake, IBRS))
> 400 GB RAM
> RAID10 with about 40k IOPS
>
> Os
> CentOS Linux release 7.7.1908
> kernel 3.10.0-1062.18.1.el7.x86_64
>
> Databasesize 100 GB (entirely fit in memory :) )
> server_version 10.12
> effective_cache_size 192000 MB
> maintenance_work_mem 2048 MB
> max_connections 150
> shared_buffers 64000 MB
> work_mem 96 MB
>
> On normal state, i have about 500 tps, 5% usage of cores, about 3% of
> load, whole database fits in memory, no reads from disk, only writes on
> about 500 IOPS level, sometimes in spikes on 1500 IOPS level, but on this
> hardware there is no problem with this values (no iowaits on cores). In
> normal state this machine does "nothing". Connections to database are
> created by two app servers based on Java, through connection pools, so
> connections count is limited by configuration of pools and max is 120, is
> lower value than in Postgres configuration (150). On normal state there is
> about 20 connections, when stuck goes into max (120).
>
> In correlation with stucks i see informations in kernel log about
> NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s! [postmaster:33935]
> but i don't know this is reason or effect of problem
> I made investigation with pgBadger and ... nothing strange happens, just
> normal statements
>
> Any ideas?
>

you can try to install perf + debug symbols for postgres. When you will
have this problem again run "perf top". You can see what routines eat your
CPU.

Maybe it can be a spinlock problem

https://www.postgresql.org/message-id/CAHyXU0yAsVxoab2PcyoCuPjqymtnaE93v7bN4ctv2aNi92fefA%40mail.gmail.com

Can be interesting a reply on Merlin's question from mail/.

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
cat /sys/kernel/mm/redhat_transparent_hugepage/defrag

Regards

Pavel


>
> Thanks,
> Kris
>
>
>


Re: When to use PARTITION BY HASH?

2020-06-05 Thread Imre Samu
> "Bulk loads ...",

As I see - There is an interesting bulkload benchmark:

"How Bulkload performance is affected by table partitioning in PostgreSQL"
by Beena Emerson (Enterprisedb, December 4, 2019 )





*SUMMARY: This article covers how benchmark tests can be used to
demonstrate the effect of table partitioning on performance. Tests using
range- and hash-partitioned tables are compared and the reasons for their
different results are explained:  1. Range partitions
   2. Hash partitions 3. Combination graphs
 4. Explaining the behavior 5. Conclusion*

*"For the hash-partitioned table, the first value is inserted in the first
partition, the second number in the second partition and so on till all the
partitions are reached before it loops back to the first partition again
until all the data is exhausted. Thus it exhibits the worst-case scenario
where the partition is repeatedly switched for every value inserted. As a
result, the number of times the partition is switched in a
range-partitioned table is equal to the number of partitions, while in a
hash-partitioned table, the number of times the partition has switched is
equal to the amount of data being inserted. This causes the massive
difference in timing for the two partition types."*

https://www.enterprisedb.com/postgres-tutorials/how-bulkload-performance-affected-table-partitioning-postgresql

Regards,
 Imre


Oleksandr Shulgin  ezt írta (időpont: 2020.
jún. 2., K, 19:17):

> Hi!
>
> I was reading up on declarative partitioning[1] and I'm not sure what
> could be a possible application of Hash partitioning.
>
> Is anyone actually using it?  What are typical use cases?  What benefits
> does such a partitioning scheme provide?
>
> On its face, it seems that it can only give you a number of tables which
> are smaller than the un-partitioned one, but I fail to see how it would
> provide any of the potential advantages listed in the documentation.
>
> With a reasonable hash function, the distribution of rows across
> partitions should be more or less equal, so I wouldn't expect any of the
> following to hold true:
> - "...most of the heavily accessed rows of the table are in a single
> partition or a small number of partitions."
> - "Bulk loads and deletes can be accomplished by adding or removing
> partitions...",
> etc.
>
> That *might* turn out to be the case with a small number of distinct
> values in the partitioning column(s), but then why rely on hash
> assignment instead of using PARTITION BY LIST in the first place?
>
> Regards,
> --
> Alex
>
> [1] https://www.postgresql.org/docs/12/ddl-partitioning.html
>
>


Re: Date vs Timestamp without timezone Partition Key

2020-06-05 Thread Cedric Leong
Somewhat unrelated but note to anyone who wants to swap out partition keys.
Don't create a clone of the table with the new partition key and insert
data. It messes up the query planner massively and makes everything much
slower.

On Mon, May 25, 2020 at 12:48 AM Tom Lane  wrote:

> Cedric Leong  writes:
> > Just in case someone is interested enough to answer this. Does anyone
> know
> > if the performance for a date column vs a timestamp column as a partition
> > key is large?
>
> I doubt it's even measurable, at least on 64-bit machines.  You're
> basically talking about 32-bit integer comparisons vs 64-bit integer
> comparisons.
>
> On a 32-bit machine it's possible that an index on a date column
> will be physically smaller, so you could get some wins from reduced
> I/O.  But on (most?) 64-bit machines that difference goes away too,
> because of alignment restrictions.
>
> As always, YMMV; it never hurts to do your own testing.
>
> regards, tom lane
>


Re: Date vs Timestamp without timezone Partition Key

2020-06-05 Thread David Rowley
On Sat, 6 Jun 2020 at 14:12, Cedric Leong  wrote:
> Somewhat unrelated but note to anyone who wants to swap out partition keys. 
> Don't create a clone of the table with the new partition key and insert data. 
> It messes up the query planner massively and makes everything much slower.

That complaint would have more meaning if you'd mentioned which
version of PostgreSQL you're using. The performance of partitioning in
PostgreSQL has changed significantly over the past 3 releases. Also
would be useful to know what you've actually done (actual commands).
I can't imagine it makes *everything* slower, so it might be good to
mention what is actually slower.

David




Re: Date vs Timestamp without timezone Partition Key

2020-06-05 Thread Cedric Leong
It's less of a complaint rather than just a warning not to do what I did.

Version:
PostgreSQL 11.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.9.3, 64-bit

The actual command list would probably be impractical to put in here just
because the majority of it would just be creating a large amount of
partition tables. But in summary what i've done is basically this:
Existing database has a partitioned fact table
1.  Create an exact clone of that partitioned fact table which includes all
the same indexes, columns, and partitioned tables
2. Change the partitioned table's partition key from an indexed date column
to an indexed timestamp without timezone column
3. Do an insert into from the old partitioned fact table to the new
partitioned fact table which includes all the same rows (insert into since
i wanted the timestamp without timezone column to be in a new timezone)
4. Switch the names of the tables so the new one will be the one that's used
5. VACUUM FULL; ANALYZE;

For my use case which is a data warehouse star schema, this fact table is
basically the base table of every report. To be more specific, the reports
I've tested on varied from 2x slower to 4x slower. From what I see so far
that's because the query plan is drastically different for both. An example
of this test would look like this: https://explain.depesz.com/s/6rP8 and
https://explain.depesz.com/s/cLUY
These tests are running the exact same query on two different tables with
the exception that they use their respective partition keys.


On Fri, Jun 5, 2020 at 10:17 PM David Rowley  wrote:

> On Sat, 6 Jun 2020 at 14:12, Cedric Leong  wrote:
> > Somewhat unrelated but note to anyone who wants to swap out partition
> keys. Don't create a clone of the table with the new partition key and
> insert data. It messes up the query planner massively and makes everything
> much slower.
>
> That complaint would have more meaning if you'd mentioned which
> version of PostgreSQL you're using. The performance of partitioning in
> PostgreSQL has changed significantly over the past 3 releases. Also
> would be useful to know what you've actually done (actual commands).
> I can't imagine it makes *everything* slower, so it might be good to
> mention what is actually slower.
>
> David
>


Re: Date vs Timestamp without timezone Partition Key

2020-06-05 Thread David Rowley
On Sat, 6 Jun 2020 at 14:49, Cedric Leong  wrote:
> It's less of a complaint rather than just a warning not to do what I did.

My point was really that nobody really knew what you did or what you
did it on. So it didn't seem like a worthwhile warning as it
completely lacked detail.

> These tests are running the exact same query on two different tables with the 
> exception that they use their respective partition keys.

Are you sure?  It looks like the old one does WHERE date =
((now())::date - '7 days'::interval) and the new version does
(date(created_at) = ((now())::date - '7 days'::interval). I guess you
renamed date to "created_at" and changed the query to use date(). If
that expression is not indexed then I imagine that would be a good
reason for the planner to have moved away from using the index on that
column. Also having date(created_at) will also not allow run-time
pruning to work since your partition key is "created_at".

You might be able to change the query to query a range of value on the
new timestamp column. This will allow you to get rid of the date()
function. For example:

where created_at >= date_trunc('day', now() - '7 days'::interval) and
created_at < date_trunc('day', now() - '6 days'::interval)

David




Re: Date vs Timestamp without timezone Partition Key

2020-06-05 Thread Cedric Leong
I can confirm that was the issue, after removing the expression and using
only what was indexed it definitely fixed the query plan. I appreciate all
the help you've given me, I didn't really think to look there but it makes
a ton of sense that a filter on the database would only work well if it's
indexed.

Thanks again,

On Fri, Jun 5, 2020 at 11:13 PM David Rowley  wrote:

> On Sat, 6 Jun 2020 at 14:49, Cedric Leong  wrote:
> > It's less of a complaint rather than just a warning not to do what I did.
>
> My point was really that nobody really knew what you did or what you
> did it on. So it didn't seem like a worthwhile warning as it
> completely lacked detail.
>
> > These tests are running the exact same query on two different tables
> with the exception that they use their respective partition keys.
>
> Are you sure?  It looks like the old one does WHERE date =
> ((now())::date - '7 days'::interval) and the new version does
> (date(created_at) = ((now())::date - '7 days'::interval). I guess you
> renamed date to "created_at" and changed the query to use date(). If
> that expression is not indexed then I imagine that would be a good
> reason for the planner to have moved away from using the index on that
> column. Also having date(created_at) will also not allow run-time
> pruning to work since your partition key is "created_at".
>
> You might be able to change the query to query a range of value on the
> new timestamp column. This will allow you to get rid of the date()
> function. For example:
>
> where created_at >= date_trunc('day', now() - '7 days'::interval) and
> created_at < date_trunc('day', now() - '6 days'::interval)
>
> David
>