Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED
Updates added, mixture of good and bad news: On 18-Aug.-2020 20:39, Michael Lewis wrote: How long does each process take in total? How strict does that FIFO really need to be when you are already doing SKIP LOCKED anyway? The processes all bottleneck[ed] for several minutes, approximately exponential to the number above the threshold where the problem happened. Up to around 60 concurrent worked with minimal delay but beyond that a few more added about a minute, 70 about 5 minutes, 80 about 30 minutes and beyond that was hours (I left up to 12 hours one time). However, I removed the order by clause which eliminated [only] the high CPU. The processes all stopped in the same pattern, just without the high CPU use. So the ordering was the factor in the CPU use, but was not responsible for the - forgive the pun - lock up. I then added a random few seconds of delay to each process before it executes the select in order to prevent too many of them colliding on simultaneous selects. That was enough to make the lock-up disappear and individual selects complete in a few ms, regardless of how many other concurrent transactions are in progress (tested up to 192 concurrent). But still not quite out the woods - read below. Can you expound on the partitioning? Are all consumers of the queue always hitting one active partition and anytime a row is processed, it always moves to one of many? archived type partitions? Partitions are list partitioned as 'incoming', 'processing', 'retry', 'ok', 'failed': Incoming: This is itself hash partitioned (64 partitions) approx 10M rows added/day so partitioned to allow incoming throughput; this works well. Processing: Simple partition, data is moved into this in blocks as the rows count drops below a threshold, another block is added, coming from the incoming. Retry: simple partition, non fatal errors go in here and go back into the processing queue for retries later. Failed: simple partition, fatal errors go here. Thankfully very few. OK: hash partition, as everything that was in incoming should eventually end up here. 64 partitions currently. There is one interesting thing about this. When the SELECT FOR UPDATE SKIP LOCKED is executed, reasonably frequently, the select aborts with the error: Tuple to be locked was already moved to another partition due to concurrent update. This error still persists even though the lock-up has been removed by the time delay, so there is a regular stream of transactions aborting due to this (I just re-run the transaction to recover). Now, if locking worked as I understand it, if another process locked and migrated, this should still have left the lock in place on the original partition and created a new one on the newly inserted partition until a commit was done. The second process should not have visibility on the newly inserted row and the skip locked should simply have skipped over the locked but deleted row on the original partition. What am I missing? All of this feels like some locking/partitioning issue but I'm unsure exactly what. Is that done via FDW or otherwise within the same database transaction? Are you connecting some queue consumer application code to Postgres, select for update, doing work on some remote system that is slow, and then coming back and committing the DB work? Alas not FDW, an actual external system elsewhere in the world which sends an ACK when it has processed the message. I have no control or influence on this. By the way, top-posting is discouraged here and partial quotes with interspersed comments are common practice. Noted!
Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED
Great to hear that some of the issues are now mitigated. Though, perhaps you actually require that ORDER BY if items are expected to be sitting in the queue quite some time because you have incoming queue items in a burst pattern and have to play catch up sometimes. If so, I highly suspect the index on q_id is becoming very bloated and reindex concurrently would help. > Partitions are list partitioned as 'incoming', 'processing', 'retry', 'ok', 'failed': I am unclear on what purpose a "processing" status would have. Shouldn't a row be in the incoming status & locked by select for update, until it either gets updated to ok or failed (or left alone if retry is needed)? What purpose do the retry and processing statuses serve? I don't understand your full workflow to venture a guess on how you are hitting that error regarding a row being in the wrong partition, but fewer main level partitions and removing unneeded updates seems likely to help or resolve the issue perhaps. I don't know if you might have missed my last message, and the suggestion from Laurenz to check pgstattuple. At a high level, it seems like any needed update to the rows would result in it being removed from the current partition and moved to another partition. If you are doing this in a transaction block, then you could just as well skip the select for update and just DELETE [] RETURNING from the existing partition and insert into the new partition later (use a select for update if you want to order the deletes*). If your transaction fails and gets rolled back, then the delete won't have happened and the row will get picked up by the next consumer. Another thought is that I don't know how performant that hash partitioning will be for select for update, particularly if that targets many partitions potentially. Would it be feasible to match the number of partitions to the number of consumers and actually have each of them working on one? * https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/
Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED
On 20-Aug.-2020 13:30, Michael Lewis wrote: Great to hear that some of the issues are now mitigated. Though, perhaps you actually require that ORDER BY if items are expected to be sitting in the queue quite some time because you have incoming queue items in a burst pattern and have to play catch up sometimes. If so, I highly suspect the index on q_id is becoming very bloated and reindex concurrently would help. I managed to bypass the need for the sort by relying on the active feed only sending the oldest items in for processing (it was always doing that) but based on some of the earlier e-mails in this thread, it prompted the revelation that my order by when processing was really pretty pointless because I need more-or-less ordered rather than strictly ordered and that was already happening due to how the process list was being fed. I don't know if you might have missed my last message, and the suggestion from Laurenz to check pgstattuple. I still need to look at that, but since I had made some progress, I got pretty exited and have not got round to this yet. * https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/ This does warn about the overhead, but I've also upgraded pg_top on my system today and saw a useful additional data point that it displays - the number of locks held by a process. What I see happening is that when the select statements collide, they are holding about 10-12 locks each and then begin to very slowly acquire more locks every few seconds. One process will grow quicker than others then reach the target (250) and start processing. Then another takes the lead and so on until a critical mass is reached and then the remaining all acquire their locks in a few seconds. I still keep thinking there is some scaling type issue here in the locking and possibly due to it being a partitioned table (due to that tuple moved error).
Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED
Can you share an explain analyze for the query that does the select for update? I wouldn't assume that partition pruning is possible at all with hash, and it would be interesting to see how it is finding those rows. >
Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED
On 20-Aug.-2020 17:42, Michael Lewis wrote: Can you share an explain analyze for the query that does the select for update? I wouldn't assume that partition pruning is possible at all with hash, and it would be interesting to see how it is finding those rows. Well this got interesting - the already moved error showed up: Note, the actual process partitions are regular table partitions, these are not hashed. Only the incoming and completed are hashed due to row counts at either end of the processing; in flight (where the issue shows up) is quite small: [queuedb] # explain analyze select queueid,txobject,objectid,state from mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and txobject = 'ticket' limit 250 for update skip locked; ERROR: 40001: tuple to be locked was already moved to another partition due to concurrent update LOCATION: heapam_tuple_lock, heapam_handler.c:405 Time: 579.131 ms [queuedb] # explain analyze select queueid,txobject,objectid,state from mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and txobject = 'ticket' limit 250 for update skip locked; ERROR: 40001: tuple to be locked was already moved to another partition due to concurrent update LOCATION: heapam_tuple_lock, heapam_handler.c:405 Time: 568.008 ms [queuedb] # explain analyze select queueid,txobject,objectid,state from mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and txobject = 'ticket' limit 250 for update skip locked; QUERY PLAN -- Limit (cost=0.00..25.71 rows=250 width=34) (actual time=1306.041..1306.338 rows=250 loops=1) -> LockRows (cost=0.00..7934.38 rows=77150 width=34) (actual time=1306.040..1306.315 rows=250 loops=1) -> Append (cost=0.00..7162.88 rows=77150 width=34) (actual time=520.685..1148.347 rows=31500 loops=1) -> Seq Scan on queue_tx_active (cost=0.00..6764.50 rows=77000 width=34) (actual time=520.683..1145.258 rows=31500 loops=1) Filter: ((txobject = 'ticket'::mq.queue_object) AND ((state = 'tx_active'::mq.tx_state) OR (state = 'tx_fail_retryable'::mq.tx_state))) -> Seq Scan on queue_tx_fail_retryable (cost=0.00..12.62 rows=150 width=34) (never executed) Filter: ((txobject = 'ticket'::mq.queue_object) AND ((state = 'tx_active'::mq.tx_state) OR (state = 'tx_fail_retryable'::mq.tx_state))) Planning Time: 0.274 ms Execution Time: 1306.380 ms (9 rows) Time: 1317.150 ms (00:01.317) [queuedb] #
Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED
On Thu, Aug 20, 2020 at 4:40 PM Jim Jarvie wrote: > On 20-Aug.-2020 17:42, Michael Lewis wrote: > > Can you share an explain analyze for the query that does the select for > update? I wouldn't assume that partition pruning is possible at all with > hash, and it would be interesting to see how it is finding those rows. > > Well this got interesting - the already moved error showed up: Note, the > actual process partitions are regular table partitions, these are not > hashed. Only the incoming and completed are hashed due to row counts at > either end of the processing; in flight (where the issue shows up) is quite > small: > > [queuedb] # explain analyze select queueid,txobject,objectid,state from > mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and > txobject = 'ticket' limit 250 for update skip locked; > ERROR: 40001: tuple to be locked was already moved to another partition > due to concurrent update > LOCATION: heapam_tuple_lock, heapam_handler.c:405 > Time: 579.131 ms > That is super curious. I hope that someone will jump in with an explanation or theory on this. I still wonder why the move between partitions is needed though if the work is either done (failed or successful) or not done... not started, retry needed or in progress... it doesn't matter. It needs to get picked up by the next process if it isn't already row locked. >
Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED
On Fri, 21 Aug 2020 at 11:01, Michael Lewis wrote: > > On Thu, Aug 20, 2020 at 4:40 PM Jim Jarvie wrote: >> >> On 20-Aug.-2020 17:42, Michael Lewis wrote: >> >> Can you share an explain analyze for the query that does the select for >> update? I wouldn't assume that partition pruning is possible at all with >> hash, and it would be interesting to see how it is finding those rows. >> >> Well this got interesting - the already moved error showed up: Note, the >> actual process partitions are regular table partitions, these are not >> hashed. Only the incoming and completed are hashed due to row counts at >> either end of the processing; in flight (where the issue shows up) is quite >> small: >> >> [queuedb] # explain analyze select queueid,txobject,objectid,state from >> mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and >> txobject = 'ticket' limit 250 for update skip locked; >> ERROR: 40001: tuple to be locked was already moved to another partition due >> to concurrent update >> LOCATION: heapam_tuple_lock, heapam_handler.c:405 >> Time: 579.131 ms > > That is super curious. I hope that someone will jump in with an explanation > or theory on this. > > I still wonder why the move between partitions is needed though if the work > is either done (failed or successful) or not done... not started, retry > needed or in progress... it doesn't matter. It needs to get picked up by the > next process if it isn't already row locked. I may be heading off in the wrong direction as I'm not fully sure I understand what the complaint is about, but isn't the executor just hitting dead rows in one of the active or failed partitions that have been moved off to some other partition? When updates occur in a non-partitioned table we can follow item pointer chains to find the live row and check if the WHERE clause still matches to determine if the row should be updated, or in this case just locked since it's a SELECT FOR UPDATE. However, with partitioned table, a concurrent UPDATE may have caused the row to have been moved off to another partition, in which case the tuple's item pointer cannot point to it since we don't have enough address space, we only have 6 bytes for a TID. To get around the fact that we can't follow these update chains, we just throw the serialization error, which is what you're getting. Ideally, we'd figure out where the live version of the tuple is and check if it matches the WHERE clause and lock it if it does, but we've no means to do that with the current design. If the complaint is about the fact that you're getting the error and you think you shouldn't be because you said "SKIP LOCKED" then I'm not really sure the fact that you said "SKIP LOCKED" gives us the right to ignore this case. The user only gave us the go-ahead to skip locked tuples, not skip tuples that we just failed to follow item pointer chains for. It might be okay to do this for rows that have since been marked as complete since they no longer match your WHERE clause, however, if the row has gone from the queue_tx_active partition into the queue_tx_fail_retryable partition then I don't see why we'd have the right to skip the tuple. Your query says you want tuples that need to be retried. We can't go skipping them. So isn't the fix just to code the application to retry on 40001 errors? David
Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED
On Fri, Aug 21, 2020 at 9:58 AM Jim Jarvie wrote: > However, as I scale up the number of concurrent connections, I see a spike in > CPU (to 100% across 80 cores) when the SELECT FOR UPDATE SKIP LOCKED executes > and the select processes wait for multiple minutes (10-20 minutes) before > completing. My use case requires around 256 concurrent processors for the > queue but I've been unable to scale beyond 128 without everything grinding to > a halt. Maybe it's just getting algorithmically ugly. To claim some job rows, you have to skip all dead/non-matching tuples left behind so far at the start of the table by all the other sessions, and then also all currently locked tuples, and you have to do update-chain walks on some of them too. It all gets a bit explosive once you have such high numbers of workers. I think I'd experiment with splitting the job table up into N tables and feed jobs into all of them about evenly (by hashing, at random, whatever), and then I'd assign each consumer a "preferred" table where it looks for jobs first (perhaps my_worker_id % ntables), before trying the others in round robin order. Then they won't trample on each other's toes so much. In the past I've wondered about a hypothetical circular_seqscan option, which would cause table scans to start where they left off last time in each backend, so SELECT * FROM t LIMIT 1 repeated would show you a different row each time until you get all the way around to the start again (as we're entirely within our rights to do for a query with no ORDER BY). That'd give the system a chance to vacuum and start refilling the start of the table before you get around to it again, instead of repeatedly having to step over the same useless pages every time you need a new job. Combined with the N tables thing, you'd be approaching a sweet spot for contention and dead tuple avoidance. The synchronized_seqscans setting is related to this idea, but more expensive, different, and probably not useful. Hmm. I guess another way to avoid colliding with others' work would be to try to use SELECT * FROM t TABLESAMPLE SYSTEM (10) WHERE ... FOR UPDATE SKIP LOCKED LIMIT It's less cache-friendly, and less order-preserving, but way more contention-friendly. That has another complication though; how do you pick 10? And if it doesn't return any or enough rows, it doesn't mean there isn't enough, so you may need to be ready to fall back to the plain approach if having 250 rows is really important to you and TABLESAMPLE doesn't give you enough. Or something. By the way, when working with around 64 consumer processes I was also annoyed by the thundering herd problem when using NOTIFY. I found various workaround solutions to that, but ultimately I think we need more precise wakeups for that sort of thing, which I hope to revisit one day.
