Re: Database Stalls

2023-02-03 Thread Craig Jackson
Consider creating a pg_stat_activity history table. This would allow you to
look back at the time of incident and verify if any unusual activity was
occurring in the database. Something like:

CREATE TABLE pg_stat_activity_hist ASSELECT
now() AS sample_time,
a.*FROM
pg_stat_activity a WITH NO data;


Then with a cron job or a pg job scheduler insert the pg_stat_activity
history at some desired interval (e.g 30s, 1m or 5m):

INSERT INTO pg_stat_activity_hist
SELECT
now(),
a.*
FROM
pg_stat_activity a
WHERE
state IN ('active', 'idle in transaction’);

Then regularly purge any sample_times older than some desired interval (1
day, 1 week, 1 month).

Not a perfect solution because the problem (if a db problem) could occur
between your pg_stat_activity samples. We keep this kind of history and it
is very helpful when trying to find a post-event root cause.

Craig



On Jan 30, 2023 at 10:47:49 AM, Mok  wrote:

> Hi,
>
> We've started to observe instances of one of our databases stalling for a
> few seconds.
>
> We see a spike in wal write locks then nothing for a few seconds. After
> which we have spike latency as processes waiting to get to the db can do
> so.
>
> There is nothing in the postgres logs that give us any clues to what could
> be happening, no locks, unusually high/long running transactions, just a
> pause and resume.
>
> Could anyone give me any advice as to what to look for when it comes to
> checking the underlying disk that the db is on?
>
> Thanks,
>
> Gurmokh
>
>
>

-- 
This electronic communication and the information and any files transmitted 
with it, or attached to it, are confidential and are intended solely for 
the use of the individual or entity to whom it is addressed and may contain 
information that is confidential, legally privileged, protected by privacy 
laws, or otherwise restricted from disclosure to anyone else. If you are 
not the intended recipient or the person responsible for delivering the 
e-mail to the intended recipient, you are hereby notified that any use, 
copying, distributing, dissemination, forwarding, printing, or copying of 
this e-mail is strictly prohibited. If you received this e-mail in error, 
please return the e-mail to the sender, delete it from your computer, and 
destroy any printed copy of it.


smime.p7s
Description: S/MIME Cryptographic Signature


Postgres backup tool recommendations for multi-terabyte database in Google Cloud

2019-12-05 Thread Craig Jackson
Hi,

We are in the process of migrating an oracle database to postgres in Google
Cloud and are investigating backup/recovery tools. The database is size is
> 20TB. We have an SLA that requires us to be able to complete a full
restore of the database within 24 hours. We have been testing
pgbackreset, barman, and GCP snapshots but wanted to see if there are any
other recommendations we should consider.

*Desirable features*
- Parallel backup/recovery
- Incremental backups
- Backup directly to a GCP bucket
- Deduplication/Compression

Any suggestions would be appreciated.

Craig Jackson


Re: Postgres backup tool recommendations for multi-terabyte database in Google Cloud

2019-12-05 Thread Craig Jackson
Thanks, I'll check it out.

On Thu, Dec 5, 2019 at 12:51 PM Craig James  wrote:

> On Thu, Dec 5, 2019 at 9:48 AM Craig Jackson 
> wrote:
>
>> Hi,
>>
>> We are in the process of migrating an oracle database to postgres in
>> Google Cloud and are investigating backup/recovery tools. The database is
>> size is > 20TB. We have an SLA that requires us to be able to complete a
>> full restore of the database within 24 hours. We have been testing
>> pgbackreset, barman, and GCP snapshots but wanted to see if there are any
>> other recommendations we should consider.
>>
>> *Desirable features*
>> - Parallel backup/recovery
>> - Incremental backups
>> - Backup directly to a GCP bucket
>> - Deduplication/Compression
>>
>
> For your 24-hour-restore requirement, there's an additional feature you
> might consider: incremental restore, or what you might call "recovery in
> place"; that is, the ability to keep a more-or-less up-to-date copy, and
> then in an emergency only restore the diffs on the file system. pgbackup
> uses a built-in rsync-like feature, plus a client-server architecture, that
> allows it to quickly determine which disk blocks need to be updated.
> Checksums are computed on each side, and data are only transferred if
> checksums differ. It's very efficient. I assume that a 20 TB database is
> mostly static, with only a small fraction of the data updated in any month.
> I believe the checksums are precomputed and stored in the pgbackrest
> repository, so you can even do this from an Amazon S3 (or whatever Google's
> Cloud equivalent is for low-cost storage) backup with just modest bandwidth
> usage.
>
> In a cloud environment, you can do this on modestly-priced hardware (a few
> CPUs, modest memory). In the event of a failover, unmount your backup disk,
> spin up a big server, mount the database, do the incremental restore, and
> you're in business.
>
> Craig (James)
>
>
>> Any suggestions would be appreciated.
>>
>> Craig Jackson
>>
>
>
>

-- 
Craig


Re: good book or any other resources for Postgresql

2020-05-04 Thread Craig Jackson
We are currently engaged in an Oracle to Postgres migration. Our DBA team
has been going through this book and we have learned a lot from it.

PostgreSQL 12 High Availability Cookbook - Third Edition
https://www.packtpub.com/data/postgresql-12-high-availability-cookbook-third-edition

On Mon, May 4, 2020 at 5:42 PM Michael Lewis  wrote:

> I don't know the others, but have enjoyed and learned a great deal from
> The Art of PostgreSQL.
>
>>

-- 
Craig


Re: High COMMIT times

2021-01-07 Thread Craig Jackson
We had a similar situation recently and saw high commit times that were
caused by having unindexed foreign key columns when deleting data with
large tables involved.  You might check to see if any new foreign key
constraints have been added recently or if any foreign key indexes may have
inadvertently been removed. Indexing the foreign keys resolved our issue.

Regards,

Craig

On Wed, Jan 6, 2021 at 9:19 AM Don Seiler  wrote:

> Good morning,
>
> This week we've noticed that we're starting to see spikes where COMMITs
> are taking much longer than usual. Sometimes, quite a few seconds to
> finish. After a few minutes they disappear but then return seemingly at
> random. This becomes visible to the app and end user as a big stall in
> activity.
>
> The checkpoints are still running for their full 5 min checkpoint_timeout
> duration (logs all say "checkpoint starting: time" and I'm not seeing any
> warnings about them occurring too frequently.
>
> This is PostgreSQL 12.4 on Ubuntu 18.04, all running in MS Azure (*not*
> managed by them).
>
> # select version();
>  version
>
> -
>  PostgreSQL 12.4 (Ubuntu 12.4-1.pgdg18.04+1) on x86_64-pc-linux-gnu,
> compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
>
> I have the stats_temp_directory in a tmpfs mount. I *do* have pg_wal on
> the same premium SSD storage volume as the data directory. Normally I would
> know to separate these but I was told with the cloud storage that it's all
> virtualized anyway, plus storage IOPS are determined by disk size so having
> a smaller volume just for pg_wal would hurt me in this case. The kind folks
> in the PG community Slack suggested just having one large premium cloud
> storage mount for the data directory and leave pg_wal inside because this
> virtualization removes any guarantee of true separation.
>
> I'm wondering if others have experience running self-managed PG in a cloud
> setting (especially if in MS Azure) and what they might have seen/done in
> cases like this.
>
> Thanks,
> Don.
>
> --
> Don Seiler
> www.seiler.us
>


-- 
Craig

-- 
This electronic communication and the information and any files transmitted 
with it, or attached to it, are confidential and are intended solely for 
the use of the individual or entity to whom it is addressed and may contain 
information that is confidential, legally privileged, protected by privacy 
laws, or otherwise restricted from disclosure to anyone else. If you are 
not the intended recipient or the person responsible for delivering the 
e-mail to the intended recipient, you are hereby notified that any use, 
copying, distributing, dissemination, forwarding, printing, or copying of 
this e-mail is strictly prohibited. If you received this e-mail in error, 
please return the e-mail to the sender, delete it from your computer, and 
destroy any printed copy of it.


smime.p7s
Description: S/MIME Cryptographic Signature


Re: High COMMIT times

2021-01-08 Thread Craig Jackson
Yes, these were deferrable foreign key constraints.

On Fri, Jan 8, 2021 at 2:05 AM Laurenz Albe 
wrote:

> On Thu, 2021-01-07 at 10:49 -0700, Craig Jackson wrote:
> > We had a similar situation recently and saw high commit times that were
> caused
> >  by having unindexed foreign key columns when deleting data with large
> tables involved.
> > You might check to see if any new foreign key constraints have been added
> >  recently or if any foreign key indexes may have inadvertently been
> removed.
> >  Indexing the foreign keys resolved our issue.
>
> Were these deferred foreign key constraints?
>
> Yours,
> Laurenz Albe
> --
> Cybertec | https://www.cybertec-postgresql.com
>
>

-- 
Craig

-- 
This electronic communication and the information and any files transmitted 
with it, or attached to it, are confidential and are intended solely for 
the use of the individual or entity to whom it is addressed and may contain 
information that is confidential, legally privileged, protected by privacy 
laws, or otherwise restricted from disclosure to anyone else. If you are 
not the intended recipient or the person responsible for delivering the 
e-mail to the intended recipient, you are hereby notified that any use, 
copying, distributing, dissemination, forwarding, printing, or copying of 
this e-mail is strictly prohibited. If you received this e-mail in error, 
please return the e-mail to the sender, delete it from your computer, and 
destroy any printed copy of it.


smime.p7s
Description: S/MIME Cryptographic Signature


Re: High COMMIT times

2021-01-11 Thread Craig Jackson
How far apart are the min/max connection settings on your application
connection pool? We had a similar issue with connection storms in the past
on Oracle. One thing we did to minimize the storms was make sure there was
not a wide gap between the min/max, say no more than a 5-10 connection
difference between min/max.

Regards,

Craig

On Sun, Jan 10, 2021 at 5:42 PM Don Seiler  wrote:

> On Sat, Jan 9, 2021 at 2:07 PM Jeff Janes  wrote:
>
>>
>> How are you monitoring the COMMIT times?  What do you generally see in
>> pg_stat_activity.wait_event during the spikes/stalls?
>>
>
> Right now we just observe the COMMIT duration posted in the postgresql log
> (we log anything over 100ms).
>
> One other thing that I shamefully forgot to mention. When we see these
> slow COMMITs in the log, they coincide with a connection storm (Cat 5
> hurricane) from our apps where connections will go from ~200 to ~1200. This
> will probably disgust many, but our PG server's max_connections is set to
> 2000. We have a set of pgbouncers in front of this with a total
> max_db_connections of 1600. I know many of you think this defeats the whole
> purpose of having pgbouncer and I agree. I've been trying to explain as
> much and that even with 32 CPUs on this DB host, we probably shouldn't
> expect to be able to support more than 100-200 active connections, let
> alone 1600. I'm still pushing to have our app server instances (which also
> use their own JDBC (Hikari) connection pool and *then* go through
> pgbouncer) to lower their min/max connection settings but obviously it's
> sort of counterintuitive at first glance but hopefully everyone sees the
> bigger picture.
>
> One nagging question I have is if the slow COMMIT is triggering the
> connection storm (eg app sees slow response or timeout from a current
> connection and fires off a new connection in its place), or vice-versa.
> We're planning to deploy new performant cloud storage (Azure Ultra disk)
> just for WAL logs but I'm hesitant to say it'll be a silver bullet when we
> still have this insane connection management strategy in place.
>
> Curious to know what others think (please pull no punches) and if others
> have been in a similar scenario with anecdotes to share.
>
> Thanks,
> Don.
>
> --
> Don Seiler
> www.seiler.us
>


-- 
Craig

-- 
This electronic communication and the information and any files transmitted 
with it, or attached to it, are confidential and are intended solely for 
the use of the individual or entity to whom it is addressed and may contain 
information that is confidential, legally privileged, protected by privacy 
laws, or otherwise restricted from disclosure to anyone else. If you are 
not the intended recipient or the person responsible for delivering the 
e-mail to the intended recipient, you are hereby notified that any use, 
copying, distributing, dissemination, forwarding, printing, or copying of 
this e-mail is strictly prohibited. If you received this e-mail in error, 
please return the e-mail to the sender, delete it from your computer, and 
destroy any printed copy of it.


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Vacuum Questions

2025-05-05 Thread Craig Jackson
For the two indexes that take 12 hours to vacuum: If you can drop and
rebuild them in less than the 12 hours it takes to vacuum them and you can
have them be offline then I would do that. If you can't take the indexes
offline then consider reindexing online.

Also, if the indexes aren't needed for your delete statements, dropping the
indexes before your start the deletion work would allow you to avoid the
overhead of maintaining the indexes, possibly speeding up the delete
statements.

Craig Jackson

On Sat, May 3, 2025 at 1:46 PM  wrote:

> Hi
>
>
>
> It is not your question but for such situations, you should consider using
> partitioning.
>
> And more closely to your question: I would not disable autovacuum but it
> must not work with default values.
>
>
>
> Best regards
>
>
>
> *Michel SALAIS*
>
> *De :* Leo 
> *Envoyé :* vendredi 2 mai 2025 16:23
> *À :* [email protected]
> *Objet :* Vacuum Questions
>
>
>
> I have been working on AWS PostgreSQL RDS for a few years, but still not
> very experienced when it comes to performance issues.  Plus RDS is slightly
> different from the pure PostgreSQL.
>
>
>
> I am trying to comprehend exactly how vacuum works.
>
>
>
> Here is what I am trying to do.
>
>
>
> I am purging old records from a table (500 million rows, but I am doing it
> in sets of  50,000,000 with a smaller loop of 100,000).  That works just
> fine.
>
>
>
> Because of the amount of data/rows deleted, I disabled the autovacuum for
> this table (I want to have control over vacuum, autovacuum does not
> complete anyway due to the timeout, sizing, etc settings that I do not want
> to change system wide).  I will put the autovacuum back once I am done of
> course.
>
>
>
> The issue is when I start vacuuming.  This table has 4 indexes and a PK
> that I worry about.  The PK takes about 30 minutes to vacuum and two of the
> indexes take about an hour each.  The problem comes in for the other 2
> indexes - they take 12+ hours each to vacuum:
>
>
>
> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  vacuuming
> "public.pc_workflowlog"
>
> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  launched 4 parallel vacuum
> workers for index vacuuming (planned: 4)
>
> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
> "pc_workflowlog_pk" to remove 5000 row versions
>
> DETAIL:  CPU: user: 191.03 s, system: 12.43 s, elapsed: 1711.22 s
>
> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
> "workflowlo_u_publicid_g6uqp9lkn6e8" to remove 5000 row versions
>
> DETAIL:  CPU: user: 325.75 s, system: 19.75 s, elapsed: 2674.24 s
>
> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
> "workflowlo_n_workflow_2tc9k2hdtry9v" to remove 5000 row versions
>
> DETAIL:  CPU: user: 312.17 s, system: 16.94 s, elapsed: 3097.88 s
>
> *psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
> "workflowlo_n_frozenseti_2kjkbjgf3c6ro" to remove 5000 row versions*
>
> *DETAIL:  CPU: user: 41187.70 s, system: 216.14 s, elapsed: 42749.36 s*
>
> *psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
> "workflowlo_n_userid_14kqw6qdsnndw" to remove 5000 row versions *
>
> *DETAIL:  CPU: user: 41280.66 s, system: 216.67 s, elapsed: 42832.16 s*
>
> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  "pc_workflowlog": removed
> 5000 row versions in 1129870 pages
>
>
>
> I've increased max_parallel_maintenance_workers to 8 for the session and
> it used parallel 4 (one for each index I assume) to handle it and the two
> indexes were done in ~ an hour.  What I am trying to figure out is how to
> force the other two large indexes to be vacuumed in parallel - a few
> workers going against an index.  It seems it is possible to do, the index
> size is large enough to kick in, but I have not been able to figure it out
> yet.  Most of the parameters are at default values.
>
>
>
> What am I missing?
>
>
>
> I have a few other questions.  Does vacuum time depend on the number of
> dead rows only and the size of the table, or does the entire storage
> allocation (including dead tuples) also affect it?
>
>
>
> Would it be more beneficial to drop the two large indexes, purge, vacuum,
> and recreate the indexes after make more sense (I know it needs to be
> tested)?  The reason I am doing it in stages is to make sure I have enough
> time to vacuum, but maybe it would not take much longer to vacuum after the
> complete purge?
>
>
>
> Lastly, is it better to delete all the rows (500 mil) instead of doing it
> in smaller batches,

Re: Vacuum Questions

2025-05-05 Thread Craig Jackson
One additional point about reindexing I should have mentioned, make sure
you have adequate disk space to complete the reindexing.

Craig

On Mon, May 5, 2025 at 8:33 AM Craig Jackson 
wrote:

> For the two indexes that take 12 hours to vacuum: If you can drop and
> rebuild them in less than the 12 hours it takes to vacuum them and you can
> have them be offline then I would do that. If you can't take the indexes
> offline then consider reindexing online.
>
> Also, if the indexes aren't needed for your delete statements, dropping
> the indexes before your start the deletion work would allow you to avoid
> the overhead of maintaining the indexes, possibly speeding up the delete
> statements.
>
> Craig Jackson
>
> On Sat, May 3, 2025 at 1:46 PM  wrote:
>
>> Hi
>>
>>
>>
>> It is not your question but for such situations, you should consider
>> using partitioning.
>>
>> And more closely to your question: I would not disable autovacuum but it
>> must not work with default values.
>>
>>
>>
>> Best regards
>>
>>
>>
>> *Michel SALAIS*
>>
>> *De :* Leo 
>> *Envoyé :* vendredi 2 mai 2025 16:23
>> *À :* [email protected]
>> *Objet :* Vacuum Questions
>>
>>
>>
>> I have been working on AWS PostgreSQL RDS for a few years, but still not
>> very experienced when it comes to performance issues.  Plus RDS is slightly
>> different from the pure PostgreSQL.
>>
>>
>>
>> I am trying to comprehend exactly how vacuum works.
>>
>>
>>
>> Here is what I am trying to do.
>>
>>
>>
>> I am purging old records from a table (500 million rows, but I am doing
>> it in sets of  50,000,000 with a smaller loop of 100,000).  That works just
>> fine.
>>
>>
>>
>> Because of the amount of data/rows deleted, I disabled the autovacuum for
>> this table (I want to have control over vacuum, autovacuum does not
>> complete anyway due to the timeout, sizing, etc settings that I do not want
>> to change system wide).  I will put the autovacuum back once I am done of
>> course.
>>
>>
>>
>> The issue is when I start vacuuming.  This table has 4 indexes and a PK
>> that I worry about.  The PK takes about 30 minutes to vacuum and two of the
>> indexes take about an hour each.  The problem comes in for the other 2
>> indexes - they take 12+ hours each to vacuum:
>>
>>
>>
>> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  vacuuming
>> "public.pc_workflowlog"
>>
>> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  launched 4 parallel vacuum
>> workers for index vacuuming (planned: 4)
>>
>> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
>> "pc_workflowlog_pk" to remove 5000 row versions
>>
>> DETAIL:  CPU: user: 191.03 s, system: 12.43 s, elapsed: 1711.22 s
>>
>> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
>> "workflowlo_u_publicid_g6uqp9lkn6e8" to remove 5000 row versions
>>
>> DETAIL:  CPU: user: 325.75 s, system: 19.75 s, elapsed: 2674.24 s
>>
>> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
>> "workflowlo_n_workflow_2tc9k2hdtry9v" to remove 5000 row versions
>>
>> DETAIL:  CPU: user: 312.17 s, system: 16.94 s, elapsed: 3097.88 s
>>
>> *psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
>> "workflowlo_n_frozenseti_2kjkbjgf3c6ro" to remove 5000 row versions*
>>
>> *DETAIL:  CPU: user: 41187.70 s, system: 216.14 s, elapsed: 42749.36 s*
>>
>> *psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
>> "workflowlo_n_userid_14kqw6qdsnndw" to remove 5000 row versions *
>>
>> *DETAIL:  CPU: user: 41280.66 s, system: 216.67 s, elapsed: 42832.16 s*
>>
>> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  "pc_workflowlog": removed
>> 5000 row versions in 1129870 pages
>>
>>
>>
>> I've increased max_parallel_maintenance_workers to 8 for the session and
>> it used parallel 4 (one for each index I assume) to handle it and the two
>> indexes were done in ~ an hour.  What I am trying to figure out is how to
>> force the other two large indexes to be vacuumed in parallel - a few
>> workers going against an index.  It seems it is possible to do, the index
>> size is large enough to kick in, but I have not been able to figure it out
>> yet.  Most of the parameters are at default values.
>>
>>
>>
>> What am I missing?
&g