dlmarion commented on PR #5321:
URL: https://github.com/apache/accumulo/pull/5321#issuecomment-2666485492

   > Looks good, did you manually test this? Wondering what the output would 
look like and if there would be an indication that zoozap ran. That lead to the 
comment about debugOrRun
   
   I applied your suggestion and made some changes to ZooZap in 18844f8 to 
clean up zookeeper in a consistent manner for the different server types. 
ZooZap was doing one thing for compactors, and something else for tservers and 
sservers. I changed ZooZap so that in all cases it does a recursive delete on 
the resource group.
   
   Below is the output of `accumulo-cluster stop` on my local node, with the 
log4j lines removed.
   
   ```
   Stopping Accumulo cluster...
   Accumulo shut down cleanly
   Utilities and unresponsive servers will shut down in 5 seconds (Ctrl-C to 
abort)
   Executing stop on tablet servers for group default ....Cleaning tablet 
server entries from zookeeper for resource group default
   Stopping service process: tserver_default_1
   Stopping tserver_default_1 on 1.2.3.4
   done
   Executing stop on managers
   Executing stop on garbage collectors
   Executing stop on monitors
   Executing stop on scan servers for group default
   Stopping service process: manager_default_1
   Stopping manager_default_1 on 1.2.3.4
   Stopping service process: gc_default_1
   Stopping gc_default_1 on 1.2.3.4
   Cleaning scan server entries from zookeeper for resource group default
   Stopping service process: monitor_default_1
   Stopping monitor_default_1 on 1.2.3.4
   Stopping service process: sserver_default_1
   Stopping sserver_default_1 on 1.2.3.4
   Executing stop on compactors for group default
   Cleaning compactor entries from zookeeper for resource group default
   Stopping service process: compactor_default_1
   Stopping compactor_default_1 on 1.2.3.4
   Deleting compactor 
/accumulo/063fa7a3-0eda-409e-89b8-6d559d6269cf/compactors/default from zookeeper
   ```
   I think it shows several problems where the ZooKeeper cleanup is 
inconsistent.
   
     1. In the case of the compactor and sserver, you see a message "Deleting 
..... from zookeeper". This is coming from ZooZap, but you don't see it for the 
tserver. Looking at the tserver log it shows that a stop was requested, 
performed a graceful shutdown, and removed it's own lock from ZooKeeper. When 
ZooZap was executed the tserver address I think was empty, so it did not remove 
the tserver resource group default path. I think this may have been cleaned up 
by the `admin stopAll` code path, but not entirely sure at the moment.
     2. For the compactor case you see that the  you see the following:
     ```
     Executing stop on compactors for group default
     Cleaning compactor entries from zookeeper for resource group default
     Stopping service process: compactor_default_1
     Stopping compactor_default_1 on 1.2.3.4
     Deleting compactor 
/accumulo/063fa7a3-0eda-409e-89b8-6d559d6269cf/compactors/default from zookeeper
     ```
     The two lines about remove entries from ZooKeeper are from the ZooZap 
command which comes after the code in `accumulo-cluster` that performs the stop 
on the compactor process. In my setup I'm using a remote address for the 
processes, so an ssh command is being used to run the command to stop the 
compactor. I think it's possible that the ZooKeeper paths could be removed 
before the processes are stopped in this case.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to