nickva commented on issue #5127:
URL: https://github.com/apache/couchdb/issues/5127#issuecomment-2253261222

   Having failed to reproduce this locally so moved on to investigate on a 
cluster where this error happens regularly.
   
   Found a cluster where `exit:timeout` stream init timeout errors happen up to 
4000 times per minute. Noticed most of them are not generated by an error in 
the coordinator or the workers. The processes will generate those are calls to 
`fabric:design_docs/1` from the ddoc cache recover logic. The calls seem to not 
generate any failures except the left-over workers in the stream_init state, 
waiting or stream start/cancel messages, which was rather baffling at first.
   
   However after a more thorough investigation, the reason for that is that 
design docs are updated often enough that the ddoc cache is quickly firing up 
and immediately kill the `fabric:design_docs/1` process. There is nothing to 
log an error and since these are not gen_servers registered with SASL they 
don't emit any error logs, as expected.
   
   In general, we already have a fabric_streams mechanism to handle the 
coordinator being killed unexpectedly. However tracing the lifetime of the 
`fabric:design_docs/1` processes, the coordinator is often killed before it 
gets a chance to even start the auxiliary cleanup process. The current pattern 
is something like this:
   
   We submit the jobs:
   
   
https://github.com/apache/couchdb/blob/d0cf54e1ef5c7e67fe08c29c5e80a1b77ca614e7/src/fabric/src/fabric_view_all_docs.erl#L28-L30
   
   Then we spawn the cleanup process:
   
   
https://github.com/apache/couchdb/blob/d0cf54e1ef5c7e67fe08c29c5e80a1b77ca614e7/src/fabric/src/fabric_streams.erl#L49-L51
   
   Those may seem like they would happen almost immediately, however tracing 
the `init_p` call on the workers side, and trying to log the process info of 
the caller (coordinator), by the time the `init_p` function is called, the 
coordinator is already dead. Since we never spawned the cleaner process yet, 
there is nothing to clean up these workers.
   
   On the positive side, these workers don't actually do any work, they just 
wait in a receive clause, albeit with an open handle Db handle which is not too 
great.
   
   To fix this particular case we have to ensure the cleaner process starts 
even earlier. By the time the coordinator submits the jobs the cleanup process 
should be up and waiting with the node-ref tuples ready to clean them up.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to