chewbranca opened a new issue, #5127: URL: https://github.com/apache/couchdb/issues/5127
While trying to understand why we'd encounter `rexi:init_stream` errors in https://github.com/apache/couchdb/issues/5122 I believe I've identified a pattern present in at least four of the fabric RPC related modules. I think `fabric_view_all_docs.erl` is a relatively straightforward representation of the issue, so I'm going to dissect the flow from there. ### Step 1) Instantiate RPC workers We first create a set of RPC workers on the remote nodes as specified in `Shards`. This creates the handle `Workers0` with a set of references to all instantiated RPC workers. ```erlang Workers0 = fabric_util:submit_jobs( Shards, fabric_rpc, all_docs, [Options, WorkerArgs] ), ``` ### Step 2) create a set of monitors for all remote nodes ```erlang RexiMon = fabric_util:create_monitors(Workers0), ``` This creates a set of monitors on the relevant remote rexi processes for each of the nodes in question, _not_ the workers themselves: ```erlang create_monitors(Shards) -> MonRefs = lists:usort([rexi_utils:server_pid(N) || #shard{node = N} <- Shards]), rexi_monitor:start(MonRefs). ``` ### Step 3 handle `fabric_streams:start` in a `try ... after .... end` block ```erlang try case fabric_streams:start(Workers0, #shard.ref, RingOpts) of ... end after rexi_monitor:stop(RexiMon) end ``` This invokes `fabric_streams:start` in a `try` block so that `after` we invoke `rexi_monitor:stop(RexiMon)` to clear out the monitors. ### Step 4) handle the inner case clauses of Step 3) First off we have the successful case when the stream has been initialized: ```erlang {ok, Workers} -> try go(DbName, Options, Workers, CoordArgs, Callback, Acc) after fabric_streams:cleanup(Workers) end; ``` The key thing of note here is that this clause performs a `fabric_streams:cleanup(Workers)` in the `after` clause of a `try` block to ensure the remote workers are cleaned up after the job is done. However, the cleanup is performed against the subset of workers selected to perform the job in `Workers`, not the original full set of RPC workers instantiated and stored in `Workers0`. Next we have the two failure cases for this fabric operation. I'll lump them together as their behavior is identical: ```erlang {timeout, NewState} -> DefunctWorkers = fabric_util:remove_done_workers( NewState#stream_acc.workers, waiting ), fabric_util:log_timeout( DefunctWorkers, "all_docs" ), Callback({error, timeout}, Acc); {error, Error} -> Callback({error, Error}, Acc) ``` Both of these failure clauses bubble up the error through the caller provided `Callback`, however, neither performs any cleanup of the workers. In the outer `after` clause we do a `rexi_monitor:stop(RexiMon)` but that's basically a no-op to kill the dedicated monitoring process. ## Core Issue I think there are two things going on here we need to address: 1) RPC workers are not cleaned up at all upon `fabric_streams:start` error modes I think this is fairly straightforward here, we should always ensure workers are cleaned up, especially when failures happen. Basically I think we should do a `fabric_streams:cleanup` on the workers in the outer `after` clause so they're always cleaned up. 2) when we do call `fabric_streams:cleanup(Workers)` it's on `Workers` instead of `Workers0` This might be a bit more controversial, but I suspect one of the ways in which https://github.com/apache/couchdb/issues/5122 manifests is because we're not diligent about canceling RPC workers. It's possible that `fabric_streams:cleanup(Workers)` is sufficient, but I think `fabric_streams:cleanup(Workers0)` against the full original set of workers is appropriate. 3) bonus item: we should consider moving the cleanup logic to the rexi_mon monitor The core rationale here is that `after` clauses do not trigger when a process is killed, leaving the possibility of remote zombied RPC workers. In theory the remote nodes' `rexi_server` processes should get a process down notification? Again, perhaps that's sufficient, I'm personally inclined to do double bookkeeping in these types of scenarios, where we monitor from the RPC and also send out a kill signal from the coordinator side. What do folks think? ## Presence in the codebase Right now I think I've identified this pattern in the four following fabric modules, although I've not done a full audit of the other modules so there may be more instances of this: * https://github.com/apache/couchdb/blob/main/src/fabric/src/fabric_view_all_docs.erl#L28-L54 * https://github.com/apache/couchdb/blob/main/src/fabric/src/fabric_view_map.erl#L45-L84 * https://github.com/apache/couchdb/blob/main/src/fabric/src/fabric_view_reduce.erl#L36-L73 * https://github.com/apache/couchdb/blob/main/src/fabric/src/fabric_view_changes.erl#L173-L219 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
