On Mon, 5 Mar 2012 11:26:20 -0500, Mark Miller <markrmil...@gmail.com>
wrote:
On Mar 5, 2012, at 10:01 AM, dar...@ontrenet.com wrote:
If one of those 10 indexing nodes goes down or falls out of sync and
comes
back, does ZK block the state of indexing until that single node
catches
back up?
No - if a node falls out of sync or comes back, the rest of the
cluster continues as normal and the node goes into recovery.
In recovery, the node tries two things to catch up: first it tries to
peer sync - if its off by less than 100 updates, it will simply
exchange updates with the leader and come back into sync. If its off
by more than that, it will start buffering updates from the leader,
replicate the full index from the leader, and then apply its buffered
updates to get come back in sync.
The only time indexing is stopped for a node is if that node loses
its connection to zookeeper. All other nodes that can still talk to
zookeeper will continue indexing. How soon we consider that we can't
talk to zookeeper depends on the zk session timeout - I have to look,
but for an embedded ensemble, we may be defaulting this a little low
currently.
That would suggest that in our case at some point Solr drops the
connection to ZK and is unable restore the connection, even after
restarting Tomcat, many times.
I know ZK is running fine and responds with imok when i ask ruok. When
i restart Tomcat i'll see these bad things in ZK's log:
2012-03-05 17:55:07,084 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] -
Accepted socket connection from /141.105.120.152:52328
2012-03-05 17:55:07,090 [myid:] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@792] -
Connection request from old client /141.105.120.152:52328; will be
dropped if server is in r-o mode
2012-03-05 17:55:07,091 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@838] - Client
attempting to establish new session at /141.105.120.152:52328
2012-03-05 17:55:07,094 [myid:] - INFO [SyncThread:0:FileTxnLog@199] -
Creating new log file: log.1
2012-03-05 17:55:07,107 [myid:] - INFO
[SyncThread:0:ZooKeeperServer@604] - Established session
0x135e3ffdb540000 with negotiated timeout 10000 for client
/141.105.120.152:52328
2012-03-05 17:55:07,206 [myid:] - INFO [ProcessThread(sid:0
cport:-1)::PrepRequestProcessor@617] - Got user-level KeeperException
when processing sessionid:0x135e3ffdb540000 type:delete cxid:0xb
zxid:0x5 txntype:-1 reqpath:n/a Error
Path:/live_nodes/cn003.openindex.io:80_solr Error:KeeperErrorCode =
NoNode for /live_nodes/cn003.openindex.io:80_solr
Solr will not come back up, even with a clean ZK data dir. I'll clear
the dataDir of one of the stuborn Solr nodes and retry. ... The Solr
node comes back up, finally. Here's the ZK log:
2012-03-05 17:56:55,939 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] -
Accepted socket connection from /141.105.120.152:36311
2012-03-05 17:56:55,944 [myid:] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@792] -
Connection request from old client /141.105.120.152:36311; will be
dropped if server is in r-o mode
2012-03-05 17:56:55,944 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@838] - Client
attempting to establish new session at /141.105.120.152:36311
2012-03-05 17:56:55,967 [myid:] - INFO
[SyncThread:0:ZooKeeperServer@604] - Established session
0x135e3ffdb540001 with negotiated timeout 10000 for client
/141.105.120.152:36311
2012-03-05 17:56:56,058 [myid:] - INFO [ProcessThread(sid:0
cport:-1)::PrepRequestProcessor@617] - Got user-level KeeperException
when processing sessionid:0x135e3ffdb540001 type:delete cxid:0x3
zxid:0x6b txntype:-1 reqpath:n/a Error
Path:/live_nodes/cn003.openindex.io:80_solr Error:KeeperErrorCode =
NoNode for /live_nodes/cn003.openindex.io:80_solr
I'm not sure about the problem but it looks like Solr won't start fine
if there's an issue after listing all segment files. It may not be a ZK
or cloud problem at all. Any suggestions?
Thanks
- Mark Miller
lucidimagination.com