nickva commented on issue #5127: URL: https://github.com/apache/couchdb/issues/5127#issuecomment-2253261222
Having failed to reproduce this locally so moved on to investigate on a cluster where this error happens regularly. Found a cluster where `exit:timeout` stream init timeout errors happen up to 4000 times per minute. Noticed most of them are not generated by an error in the coordinator or the workers. The processes will generate those are calls to `fabric:design_docs/1` from the ddoc cache recover logic. The calls seem to not generate any failures except the left-over workers in the stream_init state, waiting or stream start/cancel messages, which was rather baffling at first. However after a more thorough investigation, the reason for that is that design docs are updated often enough that the ddoc cache is quickly firing up and immediately kill the `fabric:design_docs/1` process. There is nothing to log an error and since these are not gen_servers registered with SASL they don't emit any error logs, as expected. In general, we already have a fabric_streams mechanism to handle the coordinator being killed unexpectedly. However tracing the lifetime of the `fabric:design_docs/1` processes, the coordinator is often killed before it gets a chance to even start the auxiliary cleanup process. The current pattern is something like this: We submit the jobs: https://github.com/apache/couchdb/blob/d0cf54e1ef5c7e67fe08c29c5e80a1b77ca614e7/src/fabric/src/fabric_view_all_docs.erl#L28-L30 Then we spawn the cleanup process: https://github.com/apache/couchdb/blob/d0cf54e1ef5c7e67fe08c29c5e80a1b77ca614e7/src/fabric/src/fabric_streams.erl#L49-L51 Those may seem like they would happen almost immediately, however tracing the `init_p` call on the workers side, and trying to log the process info of the caller (coordinator), by the time the `init_p` function is called, the coordinator is already dead. Since we never spawned the cleaner process yet, there is nothing to clean up these workers. On the positive side, these workers don't actually do any work, they just wait in a receive clause, albeit with an open handle Db handle which is not too great. To fix this particular case we have to ensure the cleaner process starts even earlier. By the time the coordinator submits the jobs the cleanup process should be up and waiting with the node-ref tuples ready to clean them up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
