Filed: https://issues.apache.org/jira/browse/SOLR-5945
On Tue, Apr 1, 2014 at 11:10 AM, Jessica Mallet <mewmewb...@gmail.com>wrote: > Will do Mark. Thanks! > > > On Sun, Mar 30, 2014 at 1:29 PM, Mark Miller <markrmil...@gmail.com>wrote: > >> We don't currently retry, but I don't think it would hurt much if we did >> - at least briefly. >> >> If you want to file a JIRA issue, that would be the best way to get it in >> a future release. >> >> -- >> Mark Miller >> about.me/markrmiller >> >> On March 28, 2014 at 5:40:47 PM, Michael Della Bitta ( >> michael.della.bi...@appinions.com) wrote: >> >> Hi, Jessica, >> >> We've had a similar problem when DNS resolution of our Hadoop task nodes >> has failed. They tend to take a dirt nap until you fix the problem >> manually. Are you experiencing this in AWS as well? >> >> I'd say the two things to do are to poll the node state via HTTP using a >> monitoring tool so you get an immediate notification of the problem, and >> to >> install some sort of caching server like nscd if you expect to have DNS >> resolution failures regularly. >> >> >> >> Michael Della Bitta >> >> Applications Developer >> >> o: +1 646 532 3062 >> >> appinions inc. >> >> "The Science of Influence Marketing" >> >> 18 East 41st Street >> >> New York, NY 10017 >> >> t: @appinions <https://twitter.com/Appinions> | g+: >> plus.google.com/appinions< >> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts >> > >> w: appinions.com <http://www.appinions.com/> >> >> >> On Fri, Mar 28, 2014 at 4:27 PM, Jessica Mallet <mewmewb...@gmail.com >> >wrote: >> >> > Hi, >> > >> > First off, I'd like to give a disclaimer that this probably is a very >> edge >> > case issue. However, since it happened to us, I would like to get some >> > advice on how to best handle this failure scenario. >> > >> > Basically, we had some network issue where we temporarily lost >> connection >> > and DNS. The zookeeper client properly triggered the watcher. However, >> when >> > trying to reconnect, this following Exception is thrown: >> > >> > 2014-03-27 17:24:46,882 ERROR [main-EventThread] SolrException.java >> (line >> > 121) :java.net.UnknownHostException: <host name (scrubbed)>: Name or >> > service not known >> > at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) >> > at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:866) >> > at >> > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1258) >> > at java.net.InetAddress.getAllByName0(InetAddress.java:1211) >> > at java.net.InetAddress.getAllByName(InetAddress.java:1127) >> > at java.net.InetAddress.getAllByName(InetAddress.java:1063) >> > at >> > >> > >> org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60) >> > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) >> > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380) >> > at >> > org.apache.solr.common.cloud.SolrZooKeeper.<init>(SolrZooKeeper.java:41) >> > at >> > >> > >> org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:53) >> > at >> > >> > >> org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:147) >> > at >> > >> > >> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) >> > at >> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) >> > >> > I tried to look at the code and it seems that there'd be no further >> retries >> > to connect to Zookeeper, and the node is basically left in a bad state >> and >> > will not recover on its own. (Please correct me if I'm reading this >> wrong.) >> > Thinking about it, this is probably fair, since normally you wouldn't >> > expect retries to fix an "unknown host" issue--even though in our case >> it >> > would have--but I'm wondering what we should do to handle this >> situation if >> > it happens again in the future. >> > >> > Any advice is appreciated. >> > >> > Thanks, >> > Jessica >> > >> > >