Well, it seems to work. I wonder what the best way to test this would be? How can I remove a node from a cluster but still have it be up and running?
Jim On Wed, Jul 24, 2013 at 12:10 PM, Jim Musil <jimtro...@gmail.com> wrote: > Wow! Awesome. Give me a bit to try to plug this into my environment. > > The other way I was going to attempt this was to use the health check file > option for the ping request handler. I would have to write a separate > process in python or something that would ping zookeeper for active nodes > and if the current box's ip is there, I would create the health check file > which would make the ping work. > > I'd prefer not to introduce yet another process that I need to keep > running, so this looks promising. > > Jim > > On Wed, Jul 24, 2013 at 11:49 AM, Timothy Potter [via Lucene] < > ml-node+s472066n4080116...@n3.nabble.com> wrote: > >> Hi Jim, >> >> Based on our discussion, I cooked up this solution for my book Solr in >> Action and would appreciate you looking it over to see if it meets >> your needs. The basic idea is to extend Solr's built-in >> PingRequestHandler to verify a replica is connected to Zookeeper and >> is in the "active" state. To enable this, install the custom JAR and >> then update your solrconfig.xml to use this class instead of the >> built-in one for the /admin/ping request handler: >> >> <requestHandler name="/admin/ping" >> class="sia.ch13.ClusterStateAwarePingRequestHandler"> >> >> >> >> >>>> Code <<<< >> >> package sia.ch13; >> >> import org.apache.solr.cloud.CloudDescriptor; >> import org.apache.solr.cloud.ZkController; >> import org.apache.solr.common.SolrException; >> import org.apache.solr.common.cloud.ClusterState; >> import org.apache.solr.common.cloud.Slice; >> import org.apache.solr.core.CoreContainer; >> import org.apache.solr.core.CoreDescriptor; >> import org.apache.solr.core.SolrCore; >> import org.apache.solr.handler.PingRequestHandler; >> import org.apache.solr.request.SolrQueryRequest; >> import org.apache.solr.response.SolrQueryResponse; >> import org.slf4j.Logger; >> import org.slf4j.LoggerFactory; >> >> /** >> * Extends Solr's PingRequestHandler to check a replica's cluster >> status as part of the health check. >> */ >> public class ClusterStateAwarePingRequestHandler extends >> PingRequestHandler { >> >> public static Logger log = >> LoggerFactory.getLogger(ClusterStateAwarePingRequestHandler.class); >> >> @Override >> public void handleRequestBody(SolrQueryRequest solrQueryRequest, >> SolrQueryResponse solrQueryResponse) throws Exception { >> // delegate to the base class to check the status of this local >> index >> super.handleRequestBody(solrQueryRequest, solrQueryResponse); >> >> // if ping status is OK, then check cluster state of this core >> if ("OK".equals(solrQueryResponse.getValues().get("status"))) { >> verifyThisReplicaIsActive(solrQueryRequest.getCore()); >> } >> } >> >> /** >> * Verifies this replica is "active". >> */ >> protected void verifyThisReplicaIsActive(SolrCore solrCore) throws >> SolrException { >> String replicaState = "unknown"; >> String nodeName = "?"; >> String shardName = "?"; >> String collectionName = "?"; >> String role = "?"; >> Exception exc = null; >> try { >> CoreDescriptor coreDescriptor = solrCore.getCoreDescriptor(); >> CoreContainer coreContainer = >> coreDescriptor.getCoreContainer(); >> CloudDescriptor cloud = coreDescriptor.getCloudDescriptor(); >> >> shardName = cloud.getShardId(); >> collectionName = cloud.getCollectionName(); >> role = (cloud.isLeader() ? "Leader" : "Replica"); >> >> ZkController zkController = coreContainer.getZkController(); >> if (zkController != null) { >> nodeName = zkController.getNodeName(); >> if (zkController.isConnected()) { >> ClusterState clusterState = >> zkController.getClusterState(); >> Slice slice = >> clusterState.getSlice(collectionName, shardName); >> replicaState = (slice != null) ? slice.getState() : >> "gone"; >> } else { >> replicaState = "not connected to Zookeeper"; >> } >> } else { >> replicaState = "Zookeeper not enabled/configured"; >> } >> } catch (Exception e) { >> replicaState = "error determining cluster state"; >> exc = e; >> } >> >> if ("active".equals(replicaState)) { >> log.info(String.format("%s at %s for %s in the %s >> collection is active.", >> role, nodeName, shardName, collectionName)); >> } else { >> // fail the ping by raising an exception >> String errMsg = String.format("%s at %s for %s in the %s >> collection is not active! State is: %s", >> role, nodeName, shardName, collectionName, >> replicaState); >> if (exc != null) { >> throw new >> SolrException(SolrException.ErrorCode.SERVER_ERROR, errMsg, exc); >> } else { >> throw new >> SolrException(SolrException.ErrorCode.SERVER_ERROR, errMsg); >> } >> } >> } >> } >> >> On Tue, Jul 23, 2013 at 1:46 PM, jimtronic <[hidden >> email]<http://user/SendEmail.jtp?type=node&node=4080116&i=0>> >> wrote: >> >> > I think the best bet here would be a ping like handler that would >> simply >> > return the state of only this box in the cluster: >> > >> > Something like /admin/state which would return >> > "down","active","leader","recovering" >> > >> > I'm not really sure where to begin however. Any ideas? >> > >> > jim >> > >> > On Mon, Jul 22, 2013 at 12:52 PM, Timothy Potter [via Lucene] < >> > [hidden email] <http://user/SendEmail.jtp?type=node&node=4080116&i=1>> >> wrote: >> > >> >> There is but I couldn't get it to work in my environment on Jetty, >> see: >> >> >> >> >> >> >> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%3CCAJt9Wnib+p_woYODtrSPhF==v8Vx==mDBd_qH=x_knbw-BnPXQ@...%3E >> < >> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%3CCAJt9Wnib+p_woYODtrSPhF==v8Vx==mDBd_qH=x_knbw-BnPXQ@...%3E<http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%3CCAJt9Wnib+p_woYODtrSPhF==v8Vx==mDBd_qH=x_knbw-bn...@mail.gmail.com%3E>> >> >> >> >> >> Let me know if you have any better luck. I had to resort to something >> >> hacky but was out of time I could devote to such unproductive >> >> endeavors ;-) >> >> >> >> On Mon, Jul 22, 2013 at 10:49 AM, jimtronic <[hidden email]< >> http://user/SendEmail.jtp?type=node&node=4079518&i=0>> >> >> wrote: >> >> >> >> > I'm not sure why it went down exactly -- I restarted the process and >> >> lost the >> >> > logs. (d'oh!) >> >> > >> >> > An OOM seems likely, however. Is there a setting for killing the >> >> processes >> >> > when solr encounters an OOM? >> >> > >> >> > Thanks! >> >> > >> >> > Jim >> >> > >> >> > >> >> > >> >> > -- >> >> > View this message in context: >> >> >> http://lucene.472066.n3.nabble.com/Node-down-but-not-out-tp4079495p4079507.html >> >> >> >> > Sent from the Solr - User mailing list archive at Nabble.com. >> >> >> >> >> >> ------------------------------ >> >> If you reply to this email, your message will be added to the >> discussion >> >> below: >> >> >> >> >> >> >> . >> >> NAML< >> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >> >> >> >> > >> > >> > >> > >> > -- >> > View this message in context: >> http://lucene.472066.n3.nabble.com/Node-down-but-not-out-tp4079495p4079856.html >> >> > Sent from the Solr - User mailing list archive at Nabble.com. >> >> >> ------------------------------ >> If you reply to this email, your message will be added to the >> discussion below: >> >> http://lucene.472066.n3.nabble.com/Node-down-but-not-out-tp4079495p4080116.html >> To unsubscribe from Node down, but not out, click >> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4079495&code=amltdHJvbmljQGdtYWlsLmNvbXw0MDc5NDk1fDEzMjQ4NDk0MTQ=> >> . >> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >> > > -- View this message in context: http://lucene.472066.n3.nabble.com/Node-down-but-not-out-tp4079495p4080169.html Sent from the Solr - User mailing list archive at Nabble.com.