sergey-safarov commented on issue #4790:
URL: https://github.com/apache/couchdb/issues/4790#issuecomment-2380707745

   We have cached the same issue on v3.3.3
   Also on the one CouchDB node, I can see "Node not responding"
   ```
   Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.29864.2> -------- ** Node 
'[email protected]' not responding **
   Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: ** Removing (timedout) 
connection **
   Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.29864.2> -------- ** Node 
'[email protected]' not responding **
   Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: ** Removing (timedout) 
connection **
   Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.20680.5805> -------- 1 conflicted shard in 
cluster
   Sep 28 01:47:08 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.4340.5788> -------- 1 conflicted shard in cluster
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.19100.5806> -------- fabric_worker_timeout 
get_all_security,'[email protected]',<<"shards/c0000000-dfffffff/account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409.1725148810">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.19100.5806> -------- fabric_worker_timeout 
get_all_security,'[email protected]',<<"shards/c0000000-dfffffff/account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409.1725148810">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.19100.5806> -------- Error checking security 
objects for account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409 :: {error,timeout}
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.11649.5811> -------- fabric_worker_timeout 
update_docs,'[email protected]',<<"shards/40000000-5fffffff/_global_changes.1660293400">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.11649.5811> -------- fabric_worker_timeout 
update_docs,'[email protected]',<<"shards/40000000-5fffffff/_global_changes.1660293400">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.30041.5810> -------- fabric_worker_timeout 
get_all_security,'[email protected]',<<"shards/c0000000-dfffffff/account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409.1725148810">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.30041.5810> -------- Error checking security 
objects for account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409 :: {error,timeout}
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.7850.5798> -------- fabric_worker_timeout 
open_doc,'[email protected]',<<"shards/80000000-9fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.7850.5798> -------- fabric_worker_timeout 
open_doc,'[email protected]',<<"shards/80000000-9fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.15964.5814> -------- fabric_worker_timeout 
open_doc,'[email protected]',<<"shards/a0000000-bfffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.15964.5814> -------- fabric_worker_timeout 
open_doc,'[email protected]',<<"shards/a0000000-bfffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.32403.5807> -------- fabric_worker_timeout 
open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.32403.5807> -------- fabric_worker_timeout 
open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.10430.5762> -------- fabric_worker_timeout 
open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.10430.5762> -------- fabric_worker_timeout 
open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.11019.5802> -------- fabric_worker_timeout 
open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.11019.5802> -------- fabric_worker_timeout 
open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.28862.5796> -------- fabric_worker_timeout 
open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.28862.5796> -------- fabric_worker_timeout 
open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
[email protected] <0.14078.5795> -------- fabric_worker_timeout 
open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   ```
   
   But during troubleshooting when the issue was present I sent `/_membership` 
curl request and returned a response with all (three) nodes present online in 
the cluster. The request was sent to each CouchDB node in the cluster and 
returned the same results "three nodes online in the cluster".
   
   On the other two nodes in the cluster, I can see error messages like 
"fabric_worker_timeout open_doc" and no messages like "Node not responding".
   
   Also on the two nodes  CPU load increased to 100%.
   **db0a**
   
![image](https://github.com/user-attachments/assets/4c285c4e-75ee-45c4-8571-fe6756dd621e)
   **db0b**
   
![image](https://github.com/user-attachments/assets/8d7ea6e1-0219-4b66-96bb-487640c5ba9c)
   **db1a**
   
![image](https://github.com/user-attachments/assets/7643ac29-2fe9-4e32-bb06-dcbf7cd6641f)
   
   I am sure network connectivity is present between CouchDB nodes. Also 
`/_membership` response responded with all nodes online on all CocuhDB 
instances.
   But anyway we will adjust the recommended values and provide feedback if the 
issue is reproduced.
   ```
   [cluster] reconnect_interval_sec = 37
   [fabric] request_timeout = 60000
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to