Unexpected "canceling statement due to user request" error

2019-08-16 Thread Will Storey
Hi!

I have a query that fails due to this error and I'm trying to understand
why.

My understanding is I should only see this error if I cancel a query
manually, such as with kill -INT or with pg_cancel_backend(). However I
can't find anything doing that.

The query looks like this:

SELECT *
FROM (
  SELECT
c1, c2, [...],
EXTRACT(EPOCH FROM time) AS epoch,
to_rfc3339_us(time) AS time_rfc3339
  FROM t1
  WHERE
NOT EXISTS (SELECT 1 FROM t2 WHERE c2 = ?) AND
c2 = ? AND
time < ? AND
time > ?::timestamptz - ? * interval '1 day'
  LIMIT ?
) AS s
ORDER BY epoch DESC

t1 is partitioned on time, by month.

I run it using Perl's DBI with DBD::Pg. I wrap it in a transaction where I
first run:

SET LOCAL statement_timeout TO 1000

I know this query can time out, and it does, resulting in the error I
expect: "canceling statement due to statement timeout". The problem is
occasionally I see this other error: "canceling statement due to user
request".

Looking at the query logs, when the query fails with the user request
error, the query reached the statement timeout (at least in cases I
checked). In one instance the duration shows as 1283ms for example.

Could there be any situation where a query getting cancelled due to a
statement timeout be reported as cancelled due to user request? Or do you
have any ideas about what might be going on?

This is on PostgreSQL 9.6.14 on Ubuntu Xenial.

Thank you!




Re: Unexpected "canceling statement due to user request" error

2019-08-17 Thread Will Storey
On Sat 2019-08-17 10:32:40 -0700, Adrian Klaver wrote:
> > I know this query can time out, and it does, resulting in the error I
> > expect: "canceling statement due to statement timeout". The problem is
> > occasionally I see this other error: "canceling statement due to user
> > request".
> > 
> > Looking at the query logs, when the query fails with the user request
> > error, the query reached the statement timeout (at least in cases I
> > checked). In one instance the duration shows as 1283ms for example.
> > 
> > Could there be any situation where a query getting cancelled due to a
> > statement timeout be reported as cancelled due to user request? Or do you
> > have any ideas about what might be going on?
> 
> Here is the relevant source:
> 
> https://doxygen.postgresql.org/postgres_8c.html#ab7af3ee6ad40efb8454c6f2f790d9588
> 
> Start at around line 3078.

Thanks for the pointer!

It looks like we'd have to be entering that section and finding the
statement timeout not set, otherwise we'd not fall through to the "user
request" case.

But as far as I can tell, the only reason my query would be cancelled is
because of the statement timeout. I wonder if there is some way the
statement timeout flag is being lost/cleared. I see there's situations
where we clear timeout flags.

Something I forgot to mention: The queries are against a hot standby
server.




Re: Unexpected "canceling statement due to user request" error

2019-09-01 Thread Will Storey
On Sun 2019-09-01 19:46:19 -0400, Tom Lane wrote:
> I poked at this for awhile and concluded that what you must be seeing is
> that the statement timeout interrupt triggers, but no CHECK_FOR_INTERRUPTS
> call happens thereafter, until we get to the disable_statement_timeout()
> call in finish_xact_command().  That throws away the timeout-happened
> indicator, but it does nothing to reset InterruptPending or
> QueryCancelPending.  Those flags will also be thrown away eventually,
> after we've read a new client command:
> 
> /*
>  * (4) disable async signal conditions again.
>  *
>  * Query cancel is supposed to be a no-op when there is no query in
>  * progress, so if a query cancel arrived while we were idle, just
>  * reset QueryCancelPending. ProcessInterrupts() has that effect when
>  * it's called when DoingCommandRead is set, so check for interrupts
>  * before resetting DoingCommandRead.
>  */
> CHECK_FOR_INTERRUPTS();
> DoingCommandRead = false;
> 
> Ignoring the timeout if it happened too late is what we're supposed
> to be doing, so that's fine so far.
> 
> But: what happens if some other CHECK_FOR_INTERRUPTS occurs before we
> reach the main input loop?  Of course, exactly the symptom you saw:
> a bogus "canceling statement due to user request" error report.
> What's worse, that happens after we've completed the transaction, leaving
> the client with a very misleading idea of whether the command committed
> or not.
> 
> So how could that happen?  After experimenting, I found a pretty
> obvious candidate: statement duration logging happens after we
> do finish_xact_command, and there's a CHECK_FOR_INTERRUPTS in
> errfinish().  So I hypothesize that you have log_duration or
> log_min_duration_statement enabled.

Yes, I have log_min_duration_statement set to 0! (log_duration is off)

> I was able to replicate the problem by lobotomizing a couple of
> CHECK_FOR_INTERRUPTS that would normally happen in a simple
> "select pg_sleep()" query (see attached for-demonstration-
> purposes-only patch).  Then I get
> 
> regression=# set statement_timeout TO '1s';
> SET
> regression=# select pg_sleep(2.0);  -- completes normally after 2 sec
>  pg_sleep 
> --
>  
> (1 row)
> 
> regression=# set log_min_duration_statement TO '100ms';
> SET
> regression=# select pg_sleep(2.0);  -- sits for 2 sec, then:
> ERROR:  canceling statement due to user request
> 
> So that's not very good and we ought to do something about it.
> We can't have disable_statement_timeout() clear QueryCancelPending,
> because that might lose state about some other cause for query
> cancel.  I'm inclined to think that we need to split QueryCancelPending
> into more than one flag.
> 
> A separate question is how come the particular query you're complaining
> about has (seemingly) a fairly wide window where it never does any
> CHECK_FOR_INTERRUPTS call before terminating.  Perhaps there's someplace
> we need to sprinkle a few more of those.

Yeah, it is strange. There are many queries in this system with statement
timeouts and this is the only one where I see this and it happens several
times a day.

Thank you so much for looking into this. An impressive analysis!




Re: Unexpected "canceling statement due to user request" error

2019-09-03 Thread Will Storey
On Sun 2019-09-01 20:58:30 -0400, Tom Lane wrote:
> >> A separate question is how come the particular query you're complaining
> >> about has (seemingly) a fairly wide window where it never does any
> >> CHECK_FOR_INTERRUPTS call before terminating.  Perhaps there's someplace
> >> we need to sprinkle a few more of those.
> 
> > Yeah, it is strange. There are many queries in this system with statement
> > timeouts and this is the only one where I see this and it happens several
> > times a day.
> 
> It'd be interesting to see EXPLAIN ANALYZE output from a typical execution
> of that query.  Even better would be a test case (w/ tables and dummy
> data) so somebody else could reproduce it; but maybe looking at EXPLAIN
> ANALYZE would be enough to determine where we're missing checks.

I put output from a query with values that caused the error here:

https://gist.githubusercontent.com/horgh/653e60b8c2f071d859424517d653eb4e/raw/5d16e11d0ac884ed9dc120c4521e2de8f2e7c3d6/test.txt

I sanitised some of the names and values again. Sorry about that.

As far as a test case: I'm not sure how to reproduce it myself. It just
happens periodically (15 times in the past 24 hours). The tables are
somewhat large (60+ GiB per partition). I could try to generate dummy
versions if you like, but they would have to be significantly dummied, so
I'm not sure it would end up being representative.




Cancelled query due to buffer deadlock with recovery

2024-12-09 Thread Will Storey
Hello!

I am running Postgres 16.6 on a primary and several standby servers. I
recently observed 3 queries being cancelled with this error:

ERROR: canceling statement due to conflict with recovery (SQLSTATE 40P01)

There was this detail with the error:

User transaction caused buffer deadlock with recovery

The standbys have hot_standby_feedback=on and
max_standby_streaming_delay=1min.

As far as I can tell, there were no long running transactions at the time
on either the primary or the standby in question. I did not see any
replication delay according to monitoring.

The queries that were cancelled were reading catalog tables - pg_inherits
joined with pg_class. I typically expect them to complete quickly (they
have a 100ms statement timeout).

I understand that canceling due to conflicts should be expected when using
hot standbys, but I'm struggling to understand why it happened in this
case.

What do you think?

Thank you!

Will




Re: Disabling vacuum truncate for autovacuum

2024-12-26 Thread Will Storey
On Thu 2024-12-26 12:21:08 -0800, Jeremy Schneider wrote:
> On Mon, 16 Dec 2024 16:25:06 -0800
> Will Storey  wrote:
> 
> > I would like to disable vacuum's truncate behaviour for autovacuum.
> > Previously I had an outage due to its access exclusive lock when it
> > was replicated to a hot standby.
> > 
> > When that outage happened it was from a VACUUM call in a cronjob
> > rather than autovacuum. I now run such VACUUMs with TRUNCATE false
> > which avoids the issue for these. However I've realized that
> > autovacuum could cause this as well. This is of interest to me
> > because I'm looking at tuning autovacuum and getting rid of the
> > cronjob, but I've realized relying on autovacuum could be dangerous
> > because of the truncates.
> 
> Can you tell us a little bit more about the outage? Autovacuum is
> designed to quickly relinquish this lock if there is any contention, and
> the dangers of disabling autovacuum are significant, so your statement
> about autovac being "dangerous" will raise a lot of eyebrows.
>
> Did your outage involve hot standbys serving read-only traffic, or did
> it only involve a read-write database?
>
> What was the exact nature of the outage and how did you narrow down the
> cause to the exclusive lock held specifically during an autovacuum
> truncation?

My incident was actually not caused by autovacuum. A VACUUM was run against
the primary by a cronjob. A web service running read queries against hot
standbys went down for several minutes as its queries were stuck in a lock
queue.

While this was VACUUM, my understanding is that autovacuum could do this as
well because it does not see the queries on the hot standby that could be
blocked by it, so it won't know to stop its work. I think this issue is
part of what lead to the addition of the vacuum_truncate reloption
discussed in
https://www.postgresql.org/message-id/flat/CAHGQGwE5UqFqSq1%3DkV3QtTUtXphTdyHA-8rAj4A%3DY%2Be4kyp3BQ%40mail.gmail.com,
e.g. this message:
https://www.postgresql.org/message-id/20190408044345.ndxsnveqqlj3m67g%40alap3.anarazel.de.
I could be misunderstanding it though!

As I recall, I confirmed the cause via query logs. I noticed the table was
vacuumed at the time, which lead me to learning about the page truncation
behaviour. It has been a couple years though.

The cronjob still runs every night, but now with TRUNCATE false. I've been
thinking of trying to get rid of it and rely more on autovacuum which is
why I've been revisiting this. As well, we're no longer protected by
old_snapshot_threshold disabling the page truncation globally, due to that
being removed.




Disabling vacuum truncate for autovacuum

2024-12-16 Thread Will Storey
Hi!

I would like to disable vacuum's truncate behaviour for autovacuum.
Previously I had an outage due to its access exclusive lock when it was
replicated to a hot standby.

When that outage happened it was from a VACUUM call in a cronjob rather
than autovacuum. I now run such VACUUMs with TRUNCATE false which avoids
the issue for these. However I've realized that autovacuum could cause this
as well. This is of interest to me because I'm looking at tuning autovacuum
and getting rid of the cronjob, but I've realized relying on autovacuum
could be dangerous because of the truncates.

I believe the only way to disable this for autovacuum is by changing the
vacuum_truncate storage parameters on tables. (Ignoring the now removed
old_snapshot_threshold option). I'm thinking of altering all my tables to
turn it off. Is this a horrible idea? I expect I would need to monitor
tables for problematic growth, but that might be better than a surprise
outage. I suppose the growth could cause an outage too, but I'm thinking it
would be more controllable.

Would I need to disable the settings on catalog tables too? (To rule out
any possibility of it happening). Are there any other things I might be
missing?

I am also wondering if having an autovacuum setting to control it would be
a good idea for a feature. That would be simpler for me than altering all
my tables and help me avoid missing any (e.g. catalogs, new tables).

I might be worrying needlessly about this as maybe it is unlikely to
happen. I suppose it is workload dependent.

Thank you!

Will




Re: Disabling vacuum truncate for autovacuum

2024-12-17 Thread Will Storey
On Tue 2024-12-17 08:30:19 +0100, Laurenz Albe wrote:
> > Would I need to disable the settings on catalog tables too? (To rule out
> > any possibility of it happening). Are there any other things I might be
> > missing?
> 
> Potentially yes.  But unless you are using temporary tables or create,
> alter and drop lots of objects, that shouldn't be necessary.

I see. That makes sense. I do have some things that use temporary tables as
well as some jobs that create/drop objects. They are not very frequent nor
are there a huge number objects involved, but I suppose it could still be
an issue. I'm not keen on altering the catalogs, but it sounds like if I
want to be very safe then I would need to.

> > I am also wondering if having an autovacuum setting to control it would be
> > a good idea for a feature.
> 
> I'm all for that.

I previously had old_snapshot_threshold enabled, which would have done this
anyway I believe, including for the catalog tables. That was convenient!

> Yours,
> Laurenz Albe

Thank you Laurenz! I've read a bunch of your writing and I've learned a lot
from you. I'm a big fan :-). Thank you for what you do!