Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-20 Thread Jim Jarvie

Updates added, mixture of good and bad news:

On 18-Aug.-2020 20:39, Michael Lewis wrote:

How long does each process take in total? How strict does that FIFO really
need to be when you are already doing SKIP LOCKED anyway?


The processes all bottleneck[ed] for several minutes, approximately 
exponential to the number above the threshold where the problem 
happened.  Up to around 60 concurrent worked with minimal delay but 
beyond that a few more added about a minute, 70 about 5 minutes, 80 
about 30 minutes and beyond that was hours (I left up to 12 hours one time).


However, I removed the order by clause which eliminated [only] the high 
CPU.  The processes all stopped in the same pattern, just without the 
high CPU use.  So the ordering was the factor in the CPU use, but was 
not responsible for the - forgive the pun - lock up.


I then added a random few seconds of delay to each process before it 
executes the select in order to prevent too many of them colliding on 
simultaneous selects.  That was enough to make the lock-up disappear and 
individual selects complete in a few ms, regardless of how many other 
concurrent transactions are in progress (tested up to 192 concurrent).  
But still not quite out the woods - read below.



Can you expound on the partitioning? Are all consumers of the queue always
hitting one active partition and anytime a row is processed, it always
moves to one of many? archived type partitions?


Partitions are list partitioned as 'incoming', 'processing', 'retry', 
'ok', 'failed':


Incoming: This is itself hash partitioned (64 partitions) approx 10M 
rows added/day so partitioned to allow incoming throughput; this works well.


Processing: Simple partition, data is moved into this in blocks as the 
rows count drops below a threshold, another block is added, coming from 
the incoming.


Retry: simple partition, non fatal errors go in here and go back into 
the processing queue for retries later.


Failed: simple partition, fatal errors go here.  Thankfully very few.

OK: hash partition, as everything that was in incoming should eventually 
end up here.  64 partitions currently.


There is one interesting thing about this.  When the SELECT FOR UPDATE 
SKIP LOCKED is executed, reasonably frequently, the select aborts with 
the error:


Tuple to be locked was already moved to another partition due to 
concurrent update.


This error still persists even though the lock-up has been removed by 
the time delay, so there is a regular stream of transactions aborting 
due to this (I just re-run the transaction to recover).


Now, if locking worked as I understand it, if another process locked and 
migrated, this should still have left the lock in place on the original 
partition and created a new one on the newly inserted partition until a 
commit was done.  The second process should not have visibility on the 
newly inserted row and the skip locked should simply have skipped over 
the locked but deleted row on the original partition.


What am I missing?  All of this feels like some locking/partitioning 
issue but I'm unsure exactly what.



Is that done via FDW or otherwise within the same database transaction? Are
you connecting some queue consumer application code to Postgres, select for
update, doing work on some remote system that is slow, and then coming back
and committing the DB work?
Alas not FDW, an actual external system elsewhere in the world which 
sends an ACK when it has processed the message.  I have no control or 
influence on this.

By the way, top-posting is discouraged here and partial quotes with
interspersed comments are common practice.

Noted!


Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-20 Thread Michael Lewis
Great to hear that some of the issues are now mitigated. Though, perhaps
you actually require that ORDER BY if items are expected to be sitting in
the queue quite some time because you have incoming queue items in a burst
pattern and have to play catch up sometimes. If so, I highly suspect the
index on q_id is becoming very bloated and reindex concurrently would help.

> Partitions are list partitioned as 'incoming', 'processing', 'retry',
'ok', 'failed':

I am unclear on what purpose a "processing" status would have. Shouldn't a
row be in the incoming status & locked by select for update, until it
either gets updated to ok or failed (or left alone if retry is needed)?
What purpose do the retry and processing statuses serve? I don't understand
your full workflow to venture a guess on how you are hitting that error
regarding a row being in the wrong partition, but fewer main level
partitions and removing unneeded updates seems likely to help or resolve
the issue perhaps.

I don't know if you might have missed my last message, and the suggestion
from Laurenz to check pgstattuple.

At a high level, it seems like any needed update to the rows would result
in it being removed from the current partition and moved to another
partition. If you are doing this in a transaction block, then you could
just as well skip the select for update and just DELETE [] RETURNING from
the existing partition and insert into the new partition later (use a
select for update if you want to order the deletes*). If your transaction
fails and gets rolled back, then the delete won't have happened and the row
will get picked up by the next consumer.

Another thought is that I don't know how performant that hash partitioning
will be for select for update, particularly if that targets many partitions
potentially. Would it be feasible to match the number of partitions to the
number of consumers and actually have each of them working on one?


*
https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/


Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-20 Thread Jim Jarvie


On 20-Aug.-2020 13:30, Michael Lewis wrote:

Great to hear that some of the issues are now mitigated. Though, perhaps
you actually require that ORDER BY if items are expected to be sitting in
the queue quite some time because you have incoming queue items in a burst
pattern and have to play catch up sometimes. If so, I highly suspect the
index on q_id is becoming very bloated and reindex concurrently would help.
I managed to bypass the need for the sort by relying on the active feed 
only sending the oldest items in for processing (it was always doing 
that) but based on some of the earlier e-mails in this thread, it 
prompted the revelation that my order by when processing was really 
pretty pointless because I need more-or-less ordered rather than 
strictly ordered and that was already happening due to how the process 
list was being fed.


I don't know if you might have missed my last message, and the suggestion
from Laurenz to check pgstattuple.
I still need to look at that, but since I had made some progress, I got 
pretty exited and have not got round to this yet.

*
https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/


This does warn about the overhead, but I've also upgraded pg_top on my 
system today and saw a useful additional data point that it displays - 
the number of locks held by a process.


What I see happening is that when the select statements collide, they 
are holding about 10-12 locks each and then begin to very slowly acquire 
more locks every few seconds.  One process will grow quicker than others 
then reach the target (250) and start processing.  Then another takes 
the lead and so on until a critical mass is reached and then the 
remaining all acquire their locks in a few seconds.


I still keep thinking there is some scaling type issue here in the 
locking and possibly due to it being a partitioned table (due to that 
tuple moved error).




Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-20 Thread Michael Lewis
Can you share an explain analyze for the query that does the select for
update? I wouldn't assume that partition pruning is possible at all with
hash, and it would be interesting to see how it is finding those rows.

>


Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-20 Thread Jim Jarvie


On 20-Aug.-2020 17:42, Michael Lewis wrote:

Can you share an explain analyze for the query that does the select for
update? I wouldn't assume that partition pruning is possible at all with
hash, and it would be interesting to see how it is finding those rows.


Well this got interesting  - the already moved error showed up: Note, 
the actual process partitions are regular table partitions, these are 
not hashed.  Only the incoming and completed are hashed due to row 
counts at either end of the processing; in flight (where the issue shows 
up) is quite small:


[queuedb] # explain analyze select queueid,txobject,objectid,state from 
mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and 
txobject = 'ticket' limit 250 for update skip locked;
ERROR:  40001: tuple to be locked was already moved to another partition 
due to concurrent update

LOCATION:  heapam_tuple_lock, heapam_handler.c:405
Time: 579.131 ms
[queuedb] # explain analyze select queueid,txobject,objectid,state from 
mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and 
txobject = 'ticket' limit 250 for update skip locked;
ERROR:  40001: tuple to be locked was already moved to another partition 
due to concurrent update

LOCATION:  heapam_tuple_lock, heapam_handler.c:405
Time: 568.008 ms
[queuedb] # explain analyze select queueid,txobject,objectid,state from 
mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and 
txobject = 'ticket' limit 250 for update skip locked;

      QUERY PLAN
--
 Limit  (cost=0.00..25.71 rows=250 width=34) (actual 
time=1306.041..1306.338 rows=250 loops=1)
   ->  LockRows  (cost=0.00..7934.38 rows=77150 width=34) (actual 
time=1306.040..1306.315 rows=250 loops=1)
         ->  Append  (cost=0.00..7162.88 rows=77150 width=34) (actual 
time=520.685..1148.347 rows=31500 loops=1)
               ->  Seq Scan on queue_tx_active  (cost=0.00..6764.50 
rows=77000 width=34) (actual time=520.683..1145.258 rows=31500 loops=1)
                     Filter: ((txobject = 'ticket'::mq.queue_object) 
AND ((state = 'tx_active'::mq.tx_state) OR (state = 
'tx_fail_retryable'::mq.tx_state)))
               ->  Seq Scan on queue_tx_fail_retryable 
 (cost=0.00..12.62 rows=150 width=34) (never executed)
                     Filter: ((txobject = 'ticket'::mq.queue_object) 
AND ((state = 'tx_active'::mq.tx_state) OR (state = 
'tx_fail_retryable'::mq.tx_state)))

 Planning Time: 0.274 ms
 Execution Time: 1306.380 ms
(9 rows)

Time: 1317.150 ms (00:01.317)
[queuedb] #



Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-20 Thread Michael Lewis
On Thu, Aug 20, 2020 at 4:40 PM Jim Jarvie  wrote:

> On 20-Aug.-2020 17:42, Michael Lewis wrote:
>
> Can you share an explain analyze for the query that does the select for
> update? I wouldn't assume that partition pruning is possible at all with
> hash, and it would be interesting to see how it is finding those rows.
>
> Well this got interesting  - the already moved error showed up:  Note, the
> actual process partitions are regular table partitions, these are not
> hashed.  Only the incoming and completed are hashed due to row counts at
> either end of the processing; in flight (where the issue shows up) is quite
> small:
>
> [queuedb] # explain analyze select queueid,txobject,objectid,state from
> mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and
> txobject = 'ticket' limit 250 for update skip locked;
> ERROR:  40001: tuple to be locked was already moved to another partition
> due to concurrent update
> LOCATION:  heapam_tuple_lock, heapam_handler.c:405
> Time: 579.131 ms
>
That is super curious. I hope that someone will jump in with an explanation
or theory on this.

I still wonder why the move between partitions is needed though if the work
is either done (failed or successful) or not done... not started, retry
needed or in progress... it doesn't matter. It needs to get picked up by
the next process if it isn't already row locked.

>


Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-20 Thread David Rowley
On Fri, 21 Aug 2020 at 11:01, Michael Lewis  wrote:
>
> On Thu, Aug 20, 2020 at 4:40 PM Jim Jarvie  wrote:
>>
>> On 20-Aug.-2020 17:42, Michael Lewis wrote:
>>
>> Can you share an explain analyze for the query that does the select for
>> update? I wouldn't assume that partition pruning is possible at all with
>> hash, and it would be interesting to see how it is finding those rows.
>>
>> Well this got interesting  - the already moved error showed up:  Note, the 
>> actual process partitions are regular table partitions, these are not 
>> hashed.  Only the incoming and completed are hashed due to row counts at 
>> either end of the processing; in flight (where the issue shows up) is quite 
>> small:
>>
>> [queuedb] # explain analyze select queueid,txobject,objectid,state from 
>> mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and 
>> txobject = 'ticket' limit 250 for update skip locked;
>> ERROR:  40001: tuple to be locked was already moved to another partition due 
>> to concurrent update
>> LOCATION:  heapam_tuple_lock, heapam_handler.c:405
>> Time: 579.131 ms
>
> That is super curious. I hope that someone will jump in with an explanation 
> or theory on this.
>
> I still wonder why the move between partitions is needed though if the work 
> is either done (failed or successful) or not done... not started, retry 
> needed or in progress... it doesn't matter. It needs to get picked up by the 
> next process if it isn't already row locked.

I may be heading off in the wrong direction as I'm not fully sure I
understand what the complaint is about, but isn't the executor just
hitting dead rows in one of the active or failed partitions that have
been moved off to some other partition?

When updates occur in a non-partitioned table we can follow item
pointer chains to find the live row and check if the WHERE clause
still matches to determine if the row should be updated, or in this
case just locked since it's a SELECT FOR UPDATE. However, with
partitioned table, a concurrent UPDATE may have caused the row to have
been moved off to another partition, in which case the tuple's item
pointer cannot point to it since we don't have enough address space,
we only have 6 bytes for a TID. To get around the fact that we can't
follow these update chains, we just throw the serialization error,
which is what you're getting. Ideally, we'd figure out where the live
version of the tuple is and check if it matches the WHERE clause and
lock it if it does, but we've no means to do that with the current
design.

If the complaint is about the fact that you're getting the error and
you think you shouldn't be because you said "SKIP LOCKED" then I'm not
really sure the fact that you said "SKIP LOCKED" gives us the right to
ignore this case. The user only gave us the go-ahead to skip locked
tuples, not skip tuples that we just failed to follow item pointer
chains for.   It might be okay to do this for rows that have since
been marked as complete since they no longer match your WHERE clause,
however, if the row has gone from the queue_tx_active partition into
the queue_tx_fail_retryable partition then I don't see why we'd have
the right to skip the tuple. Your query says you want tuples that need
to be retried. We can't go skipping them.

So isn't the fix just to code the application to retry on 40001 errors?

David




Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-20 Thread Thomas Munro
On Fri, Aug 21, 2020 at 9:58 AM Jim Jarvie  wrote:
> However, as I scale up the number of concurrent connections, I see a spike in 
> CPU (to 100% across 80 cores) when the SELECT FOR UPDATE SKIP LOCKED executes 
> and the select processes wait for multiple minutes (10-20 minutes) before 
> completing.  My use case requires around 256 concurrent processors for the 
> queue but I've been unable to scale beyond 128 without everything grinding to 
> a halt.

Maybe it's just getting algorithmically ugly.  To claim some job rows,
you have to skip all dead/non-matching tuples left behind so far at
the start of the table by all the other sessions, and then also all
currently locked tuples, and you have to do update-chain walks on some
of them too.  It all gets a bit explosive once you have such high
numbers of workers.

I think I'd experiment with splitting the job table up into N tables
and feed jobs into all of them about evenly (by hashing, at random,
whatever), and then I'd assign each consumer a "preferred" table where
it looks for jobs first (perhaps my_worker_id % ntables), before
trying the others in round robin order.  Then they won't trample on
each other's toes so much.

In the past I've wondered about a hypothetical circular_seqscan
option, which would cause table scans to start where they left off
last time in each backend, so SELECT * FROM t LIMIT 1 repeated would
show you a different row each time until you get all the way around to
the start again (as we're entirely within our rights to do for a query
with no ORDER BY).  That'd give the system a chance to vacuum and
start refilling the start of the table before you get around to it
again, instead of repeatedly having to step over the same useless
pages every time you need a new job.  Combined with the N tables
thing, you'd be approaching a sweet spot for contention and dead tuple
avoidance.  The synchronized_seqscans setting is related to this idea,
but more expensive, different, and probably not useful.

Hmm.  I guess another way to avoid colliding with others' work would
be to try to use SELECT * FROM t TABLESAMPLE SYSTEM (10) WHERE ... FOR
UPDATE SKIP LOCKED LIMIT   It's less cache-friendly, and less
order-preserving, but way more contention-friendly.  That has another
complication though; how do you pick 10?  And if it doesn't return any
or enough rows, it doesn't mean there isn't enough, so you may need to
be ready to fall back to the plain approach if having 250 rows is
really important to you and TABLESAMPLE doesn't give you enough.  Or
something.

By the way, when working with around 64 consumer processes I was also
annoyed by the thundering herd problem when using NOTIFY.  I found
various workaround solutions to that, but ultimately I think we need
more precise wakeups for that sort of thing, which I hope to revisit
one day.