Re: [I] Potential pattern of ignoring stranded RPC workers [couchdb]

via GitHub Fri, 26 Jul 2024 11:24:09 -0700


nickva commented on issue #5127:
URL: https://github.com/apache/couchdb/issues/5127#issuecomment-2253261222

Having failed to reproduce this locally so moved on to investigate on a
cluster where this error happens regularly.

Found a cluster where `exit:timeout` stream init timeout errors happen up to
4000 times per minute. Noticed most of them are not generated by an error in
the coordinator or the workers. The processes will generate those are calls to
`fabric:design_docs/1` from the ddoc cache recover logic. The calls seem to not
generate any failures except the left-over workers in the stream_init state,
waiting or stream start/cancel messages, which was rather baffling at first.

However after a more thorough investigation, the reason for that is that
design docs are updated often enough that the ddoc cache is quickly firing up
and immediately kill the `fabric:design_docs/1` process. There is nothing to
log an error and since these are not gen_servers registered with SASL they
don't emit any error logs, as expected.

In general, we already have a fabric_streams mechanism to handle the
coordinator being killed unexpectedly. However tracing the lifetime of the
`fabric:design_docs/1` processes, the coordinator is often killed before it
gets a chance to even start the auxiliary cleanup process. The current pattern
is something like this:

We submit the jobs:

https://github.com/apache/couchdb/blob/d0cf54e1ef5c7e67fe08c29c5e80a1b77ca614e7/src/fabric/src/fabric_view_all_docs.erl#L28-L30

Then we spawn the cleanup process:

https://github.com/apache/couchdb/blob/d0cf54e1ef5c7e67fe08c29c5e80a1b77ca614e7/src/fabric/src/fabric_streams.erl#L49-L51

Those may seem like they would happen almost immediately, however tracing
the `init_p` call on the workers side, and trying to log the process info of
the caller (coordinator), by the time the `init_p` function is called, the
coordinator is already dead. Since we never spawned the cleaner process yet,
there is nothing to clean up these workers.

On the positive side, these workers don't actually do any work, they just
wait in a receive clause, albeit with an open handle Db handle which is not too
great.

To fix this particular case we have to ensure the cleaner process starts
even earlier. By the time the coordinator submits the jobs the cleanup process
should be up and waiting with the node-ref tuples ready to clean them up.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Potential pattern of ignoring stranded RPC workers [couchdb]

Reply via email to