chewbranca opened a new issue, #5122:
URL: https://github.com/apache/couchdb/issues/5122

   While investigating the following stacktrace, I noticed the fabric_rpc 
timeout was occurring during rexi stream initialization, yet the corresponding 
metric from `couch_stats:increment_counter([rexi, streams, timeout, 
init_stream])` was not triggering, and it turns out it'll never properly 
trigger.
   
   > mfa: fabric_rpc:all_docs/3 exit:timeout 
[{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,300}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,241}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,492}]},{couch_mrview,map_fold,3,[{file,"src/couch_mrview.erl"},{line,504}]},{couch_bt_engine,include_reductions,4,[{file,"src/couch_bt_engine.erl"},{line,1155}]},{couch_bt_engine,skip_deleted,4,[{file,"src/couch_bt_engine.erl"},{line,1151}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,1167}]},{couch_btree,stream_kp_node,8,[{file,"src/couch_btree.erl"},{line,1138}]}]
   
   The fundamental problem is that there are two nearly identical rexi stream 
initialization functions except one of them does the proper metrics 
incrementing and the other function is actually utilized.
   
   ## The two functions are:
   
   ### 1) `rexi:stream_init/1` 
https://github.com/apache/couchdb/blob/main/src/rexi/src/rexi.erl#L165-L179
   
   ```erlang
   stream_init(Timeout) ->
       case sync_reply(rexi_STREAM_INIT, Timeout) of
           rexi_STREAM_START ->
               ok;
           rexi_STREAM_CANCEL ->
               exit(normal);
           timeout ->
               couch_stats:increment_counter(
                   [rexi, streams, timeout, init_stream]
               ),
               exit(timeout);
           Else ->
               exit({invalid_stream_message, Else})
       end.
   ```
   
   and
   
   ### 2) `rexi:init_stream/1` 
https://github.com/apache/couchdb/blob/main/src/rexi/src/rexi.erl#L292-L303
   
   ```erlang
   init_stream(Timeout) ->
       case sync_reply(rexi_STREAM_INIT, Timeout) of
           rexi_STREAM_START ->
               put(rexi_STREAM_INITED, true),
               ok;
           rexi_STREAM_CANCEL ->
               exit(normal);
           timeout ->
               exit(timeout);
           Else ->
               exit({invalid_stream_message, Else})
       end.
   ```
   
   As you can see these two functions are nearly identical except 
`rexi:stream_init` increments the counter on `timeout` and `rexi:init_stream` 
does a `put(rexi_STREAM_INITED, true),` to track stream init status. However, 
`rexi:stream_init` is never actually called:
   
   ```
   chewbranca@Russells-MacBook-Pro-2 db % grep -r rexi:stream_init src
   chewbranca@Russells-MacBook-Pro-2 db % 
   ```
   
   and instead the internal function `rexi:init_stream` is utilized by way of 
`rexi:maybe_init_stream` in `rexi:stream2` and `rexi:stream_last`.
   
   The result is that we never properly increment the metric indicating 
stream_init timeouts.
   
   I think the idea was that `rexi:stream2` has a try-catch block around 
`throw:timeout` in 
https://github.com/apache/couchdb/blob/main/src/rexi/src/rexi.erl#L248-L252 and 
then that'll take care of incrementing the counter, but `maybe_init_stream` is 
called _outside_ of that try-catch block 
https://github.com/apache/couchdb/blob/main/src/rexi/src/rexi.erl#L241 so it 
would never be caught, however, `maybe_init_stream` doesn't _throw_ a timeout, 
it does a process exit with `Reason=timeout` 
https://github.com/apache/couchdb/blob/main/src/rexi/src/rexi.erl#L300 so 
that'll never be caught by a try-catch.
   
   It seems possible we can fix the underlying lack of metrics by way 
consolidating `stream_init` and `init_stream` down to a singular function that 
both sets the pdict value and increments the counter, but that might result in 
a double counting if the old `rexi:stream` is used instead of `rexi:stream2`, 
and it doesn't fix the underlying issue that there's various flow control 
elements expecting to catch on `throw(timeout)` which is not possible with 
`exit(timeout)`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to