[I] Potential pattern of ignorning stranded RPC workers [couchdb]

via GitHub Wed, 10 Jul 2024 17:37:00 -0700


chewbranca opened a new issue, #5127:
URL: https://github.com/apache/couchdb/issues/5127


   While trying to understand why we'd encounter `rexi:init_stream` errors in 
https://github.com/apache/couchdb/issues/5122 I believe I've identified a 
pattern present in at least four of the fabric RPC related modules. I think 
`fabric_view_all_docs.erl` is a relatively straightforward representation of 
the issue, so I'm going to dissect the flow from there.
   
   ### Step 1) Instantiate RPC workers
   
   We first create a set of RPC workers on the remote nodes as specified in 
`Shards`. This creates the handle `Workers0` with a set of references to all 
instantiated RPC workers.
   
   ```erlang
       Workers0 = fabric_util:submit_jobs(
           Shards, fabric_rpc, all_docs, [Options, WorkerArgs]
       ),
   ```
   
   ### Step 2) create a set of monitors for all remote nodes
   
   ```erlang
       RexiMon = fabric_util:create_monitors(Workers0),
   ```
   
   This creates a set of monitors on the relevant remote rexi processes for 
each of the nodes in question, _not_ the workers themselves:
   
   ```erlang
   create_monitors(Shards) ->
       MonRefs = lists:usort([rexi_utils:server_pid(N) || #shard{node = N} <- 
Shards]),
       rexi_monitor:start(MonRefs).
   ```
   
   ### Step 3 handle `fabric_streams:start` in a `try ... after .... end` block
   
   ```erlang
       try
           case fabric_streams:start(Workers0, #shard.ref, RingOpts) of
               ...
           end
       after
           rexi_monitor:stop(RexiMon)
       end
   ```
   
   This invokes `fabric_streams:start` in a `try` block so that `after` we 
invoke `rexi_monitor:stop(RexiMon)` to clear out the monitors.
   
   ### Step 4) handle the inner case clauses of Step 3)
   
   First off we have the successful case when the stream has been initialized:
   
   ```erlang
               {ok, Workers} ->
                   try
                       go(DbName, Options, Workers, CoordArgs, Callback, Acc)
                   after
                       fabric_streams:cleanup(Workers)
                   end;
   ```
   
   The key thing of note here is that this clause performs a 
`fabric_streams:cleanup(Workers)` in the `after` clause of a `try` block to 
ensure the remote workers are cleaned up after the job is done.
   
   However, the cleanup is performed against the subset of workers selected to 
perform the job in `Workers`, not the original full set of RPC workers 
instantiated and stored in `Workers0`.
   
   Next we have the two failure cases for this fabric operation. I'll lump them 
together as their behavior is identical:
   
   ```erlang
               {timeout, NewState} ->
                   DefunctWorkers = fabric_util:remove_done_workers(
                       NewState#stream_acc.workers, waiting
                   ),
                   fabric_util:log_timeout(
                       DefunctWorkers,
                       "all_docs"
                   ),
                   Callback({error, timeout}, Acc);
               {error, Error} ->
                   Callback({error, Error}, Acc)
   ```
   
   Both of these failure clauses bubble up the error through the caller 
provided `Callback`, however, neither performs any cleanup of the workers. In 
the outer `after` clause we do a `rexi_monitor:stop(RexiMon)` but that's 
basically a no-op to kill the dedicated monitoring process.
   
   ## Core Issue
   
   I think there are two things going on here we need to address:
   
   1) RPC workers are not cleaned up at all upon `fabric_streams:start` error 
modes
   
   I think this is fairly straightforward here, we should always ensure workers 
are cleaned up, especially when failures happen. Basically I think we should do 
a `fabric_streams:cleanup` on the workers in the outer `after` clause so 
they're always cleaned up.
   
   2) when we do call `fabric_streams:cleanup(Workers)` it's on `Workers` 
instead of `Workers0`
   
   This might be a bit more controversial, but I suspect one of the ways in 
which https://github.com/apache/couchdb/issues/5122 manifests is because we're 
not diligent about canceling RPC workers. It's possible that 
`fabric_streams:cleanup(Workers)` is sufficient, but I think 
`fabric_streams:cleanup(Workers0)` against the full original set of workers is 
appropriate.
   
   3) bonus item: we should consider moving the cleanup logic to the rexi_mon 
monitor
   
   The core rationale here is that `after` clauses do not trigger when a 
process is killed, leaving the possibility of remote zombied RPC workers. In 
theory the remote nodes' `rexi_server` processes should get a process down 
notification? Again, perhaps that's sufficient, I'm personally inclined to do 
double bookkeeping in these types of scenarios, where we monitor from the RPC 
and also send out a kill signal from the coordinator side. What do folks think?
   
   ## Presence in the codebase
   
   Right now I think I've identified this pattern in the four following fabric 
modules, although I've not done a full audit of the other modules so there may 
be more instances of this:
   
   * 
https://github.com/apache/couchdb/blob/main/src/fabric/src/fabric_view_all_docs.erl#L28-L54
   * 
https://github.com/apache/couchdb/blob/main/src/fabric/src/fabric_view_map.erl#L45-L84
   * 
https://github.com/apache/couchdb/blob/main/src/fabric/src/fabric_view_reduce.erl#L36-L73
   * 
https://github.com/apache/couchdb/blob/main/src/fabric/src/fabric_view_changes.erl#L173-L219


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Potential pattern of ignorning stranded RPC workers [couchdb]

Reply via email to