I was right for once <G>.. Thanks for updating the Wiki!
Erick On Tue, Nov 6, 2012 at 9:42 AM, Nick Chase <nch...@earthlink.net> wrote: > Thanks a million, Erick! You're right about killing both nodes hosting > the shard. I'll get the wiki corrected. > > ---- Nick > > > On 11/3/2012 10:51 PM, Erick Erickson wrote: > >> SolrCloud doesn't work unless every shard has at least one server that is >> up and running. >> >> I _think_ you might be killing both nodes that host one of the shards. The >> admin >> page has a link showing you the state of your cluster. So when this >> happens, >> does that page show both nodes for that shard being down? >> >> And yeah, SolrCloud requires a quorum of ZK nodes up. So with only one ZK >> node, killing that will bring down the whole cluster. Which is why the >> usual >> recommendation is that ZK be run externally and usually an odd number of >> ZK >> nodes (three or more). >> >> Anyone can create a login and edit the Wiki, so any clarifications are >> welcome! >> >> Best >> Erick >> >> >> On Sat, Nov 3, 2012 at 12:17 PM, Nick Chase <nch...@earthlink.net> wrote: >> >> I think there's a change in the behavior of SolrCloud vs. what's in the >>> wiki, but I was hoping someone could confirm for me. I checked JIRA and >>> there were a couple of issues requesting partial results if one server >>> comes down, but that doesn't seem to be the issue here. I also checked >>> CHANGES.txt and don't see anything that seems to apply. >>> >>> I'm running "Example B: Simple two shard cluster with shard replicas" >>> from >>> the wiki at >>> https://wiki.apache.org/solr/****SolrCloud<https://wiki.apache.org/solr/**SolrCloud> >>> <https://wiki.**apache.org/solr/SolrCloud<https://wiki.apache.org/solr/SolrCloud>>and >>> everything starts out as expected. However, when I get to the part >>> >>> about fail over behavior is when things get a little wonky. >>> >>> I added data to the shard running on 7475. If I kill 7500, a query to >>> any >>> of the other servers works fine. But if I kill 7475, rather than getting >>> zero results on a search to 8983 or 8900, I get a 503 error: >>> >>> <response> >>> <lst name="responseHeader"> >>> <int name="status">503</int> >>> <int name="QTime">5</int> >>> <lst name="params"> >>> <str name="q">*:*</str> >>> </lst> >>> </lst> >>> <lst name="error"> >>> <str name="msg">no servers hosting shard:</str> >>> <int name="code">503</int> >>> </lst> >>> </response> >>> >>> I don't see any errors in the consoles. >>> >>> Also, if I kill 8983, which includes the Zookeeper server, everything >>> dies, rather than just staying in a steady state; the other servers >>> continually show: >>> >>> Nov 03, 2012 11:39:34 AM org.apache.zookeeper.****ClientCnxn$SendThread >>> >>> startConnect >>> NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983 >>> ov 03, 2012 11:39:35 AM org.apache.zookeeper.****ClientCnxn$SendThread >>> run >>> >>> ARNING: Session 0x13ac6cf87890002 for server null, unexpected error, >>> closing socket connection and attempting reconnect >>> ava.net.ConnectException: Connection refused: no further information >>> at sun.nio.ch.SocketChannelImpl.****checkConnect(Native Method) >>> at sun.nio.ch.SocketChannelImpl.****finishConnect(Unknown >>> Source) >>> at org.apache.zookeeper.****ClientCnxn$SendThread.run(** >>> ClientCnxn.java:1143) >>> >>> ov 03, 2012 11:39:35 AM org.apache.zookeeper.****ClientCnxn$SendThread >>> >>> startConnect >>> >>> over and over again, and a call to any of the servers shows a connection >>> error to 8983. >>> >>> This is the current 4.0.0 release, running on Windows 7. >>> >>> If this is the proper behavior and the wiki needs updating, fine; I just >>> need to know. Otherwise if anybody has any clues as to what I may be >>> missing, I'd be grateful. :) >>> >>> Thanks... >>> >>> --- Nick >>> >>> >>