[1/7] git commit: ACCUMULO-1217 Add documentation about start-all.sh and start-here.sh to recover from process failure.

elserj Mon, 24 Mar 2014 17:38:14 -0700

Repository: accumulo
Updated Branches:
  refs/heads/1.6.0-SNAPSHOT 62ce7524f -> 05254f388
  refs/heads/master a9f1767b9 -> 139a192ee



ACCUMULO-1217 Add documentation about start-all.sh and start-here.sh to recover 
from process failure.


Project: http://git-wip-us.apache.org/repos/asf/accumulo/repo
Commit: http://git-wip-us.apache.org/repos/asf/accumulo/commit/3e749fb2
Tree: http://git-wip-us.apache.org/repos/asf/accumulo/tree/3e749fb2
Diff: http://git-wip-us.apache.org/repos/asf/accumulo/diff/3e749fb2

Branch: refs/heads/1.6.0-SNAPSHOT
Commit: 3e749fb2cc05a4fdae9753d97ffa99bff5aeb065
Parents: 62ce752
Author: Josh Elser <els...@apache.org>
Authored: Mon Mar 24 17:26:08 2014 -0700
Committer: Josh Elser <els...@apache.org>
Committed: Mon Mar 24 17:26:08 2014 -0700

----------------------------------------------------------------------
 .../chapters/troubleshooting.tex                | 41 ++++++++++++++++++++
 1 file changed, 41 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/accumulo/blob/3e749fb2/docs/src/main/latex/accumulo_user_manual/chapters/troubleshooting.tex
----------------------------------------------------------------------
diff --git 
a/docs/src/main/latex/accumulo_user_manual/chapters/troubleshooting.tex 
b/docs/src/main/latex/accumulo_user_manual/chapters/troubleshooting.tex
index 18d472f..3e7572d 100644
--- a/docs/src/main/latex/accumulo_user_manual/chapters/troubleshooting.tex
+++ b/docs/src/main/latex/accumulo_user_manual/chapters/troubleshooting.tex
@@ -518,6 +518,47 @@ Besides these columns, you may see:
 
 \end{enumerate}
 
+\section{Simple System Recovery}
+
+Q. One of my Accumulo processes died. How do I bring it back?
+
+The easiest way to bring all services online for an Accumulo instance is to 
run the ``start-all.sh`` script.
+
+\small
+\begin{verbatim}
+  $ bin/start-all.sh
+\end{verbatim}
+\normalsize
+
+This process will check the process listing, using ``jps`` on each host before 
attempting to restart a service on the given host.
+Typically, this check is sufficient except in the face of a hung/zombie 
process. For large clusters, it may be
+undesirable to ssh to every node in the cluster to ensure that all hosts are 
running the appropriate processes and ``start-here.sh`` may be of use.
+
+\small
+\begin{verbatim}
+  $ ssh host_with_dead_process
+  $ bin/start-here.sh
+\end{verbatim}
+\normalsize
+
+``start-here.sh`` should be invoked on the host which is missing a given 
process. Like start-all.sh, it will start all
+necessary processes that are not currently running, but only on the current 
host and not cluster-wide. Tools such as ``pssh`` or 
+``pdsh`` can be used to automate this process.
+
+``start-server.sh`` can also be used to start a process on a given host; 
however, it is not generally recommended for
+users to issue this directly as the ``start-all.sh`` and ``start-here.sh`` 
scripts provide the same functionality with
+more automation and are less prone to user error.
+
+A. Use ``start-all.sh`` or ``start-here.sh``.
+
+Q. My process died again. Should I restart it via ``cron`` or tools like 
``supervisord``?
+
+A. A repeatedly dying Accumulo process is a sign of a larger problem. 
Typically these problems are due to a
+misconfiguration of Accumulo or over-saturation of resources. Blind automation 
of any service restart inside of Accumulo
+is generally an undesirable situation as it is indicative of a problem that is 
being masked and ignored. Accumulo
+processes should be stable on the order of months and not require frequent 
restart.
+
+
 \section{Advanced System Recovery}
 
 Q. I had disasterous HDFS failure.  After bringing everything back up, several 
tablets refuse to go online.

[1/7] git commit: ACCUMULO-1217 Add documentation about start-all.sh and start-here.sh to recover from process failure.

Reply via email to