replica recovery

2015-10-27 Thread Brian Scholl
Hello,

I am experiencing a failure mode where a replica is unable to recover and it 
will try to do so forever.  In writing this email I want to make sure that I 
haven't missed anything obvious or missed a configurable option that could 
help.  If something about this looks funny, I would really like to hear from 
you.

Relevant details:
- solr 5.3.1
- java 1.8
- ubuntu linux 14.04 lts
- the cluster is composed of 1 SolrCloud collection with 100 shards backed by a 
3 node zookeeper ensemble
- there are 200 solr servers in the cluster, 1 replica per shard
- a shard replica is larger than 50% of the available disk
- ~40M docs added per day, total indexing time is 8-10 hours spread over the day
- autoCommit is set to 15s
- softCommit is not defined

I think I have traced the failure to the following set of events but would 
appreciate feedback:

1. new documents are being indexed
2. the leader of a shard, server A, fails for any reason (java crashes, times 
out with zookeeper, etc)
3. zookeeper promotes the other replica of the shard, server B, to the leader 
position and indexing resumes
4. server A comes back online (typically 10s of seconds later) and reports to 
zookeeper
5. zookeeper tells server A that it is no longer the leader and to sync with 
server B
6. server A checks with server B but finds that server B's index version is 
different from its own
7. server A begins replicating a new copy of the index from server B using the 
(legacy?) replication handler
8. the original index on server A was not deleted so it runs out of disk space 
mid-replication
9. server A throws an error, deletes the partially replicated index, and then 
tries to replicate again

At this point I think steps 6  => 9 will loop forever

If the actual errors from solr.log are useful let me know, not doing that now 
for brevity since this email is already pretty long.  In a nutshell and in 
order, on server A I can find the error that took it down, the post-recovery 
instruction from ZK to unregister itself as a leader, the corrupt index error 
message, and then the (start - whoops, out of disk- stop) loop of the 
replication messages.

I first want to ask if what I described is possible or did I get lost somewhere 
along the way reading the docs?  Is there any reason to think that solr should 
not do this?

If my version of events is feasible I have a few other questions:

1. What happens to the docs that were indexed on server A but never replicated 
to server B before the failure?  Assuming that the replica on server A were to 
complete the recovery process would those docs appear in the index or are they 
gone for good?

2. I am guessing that the corrupt replica on server A is not deleted because it 
is still viable, if server B had a catastrophic failure you could pick up the 
pieces from server A.  If so is this a configurable option somewhere?  I'd 
rather take my chances on server B going down before replication finishes than 
be stuck in this state and have to manually intervene.  Besides, I have 
disaster recovery backups for exactly this situation.

3. Is there anything I can do to prevent this type of failure?  It seems to me 
that if server B gets even 1 new document as a leader the shard will enter this 
state.  My only thought right now is to try to stop sending documents for 
indexing the instant a leader goes down but on the surface this solution sounds 
tough to implement perfectly (and it would have to be perfect).

If you got this far thanks for sticking with me.

Cheers,
Brian



Re: replica recovery

2015-10-27 Thread Brian Scholl
Whoops, in the description of my setup that should say 2 replicas per shard.  
Every server has a replica.


> On Oct 27, 2015, at 20:16, Brian Scholl  wrote:
> 
> Hello,
> 
> I am experiencing a failure mode where a replica is unable to recover and it 
> will try to do so forever.  In writing this email I want to make sure that I 
> haven't missed anything obvious or missed a configurable option that could 
> help.  If something about this looks funny, I would really like to hear from 
> you.
> 
> Relevant details:
> - solr 5.3.1
> - java 1.8
> - ubuntu linux 14.04 lts
> - the cluster is composed of 1 SolrCloud collection with 100 shards backed by 
> a 3 node zookeeper ensemble
> - there are 200 solr servers in the cluster, 1 replica per shard
> - a shard replica is larger than 50% of the available disk
> - ~40M docs added per day, total indexing time is 8-10 hours spread over the 
> day
> - autoCommit is set to 15s
> - softCommit is not defined
>   
> I think I have traced the failure to the following set of events but would 
> appreciate feedback:
> 
> 1. new documents are being indexed
> 2. the leader of a shard, server A, fails for any reason (java crashes, times 
> out with zookeeper, etc)
> 3. zookeeper promotes the other replica of the shard, server B, to the leader 
> position and indexing resumes
> 4. server A comes back online (typically 10s of seconds later) and reports to 
> zookeeper
> 5. zookeeper tells server A that it is no longer the leader and to sync with 
> server B
> 6. server A checks with server B but finds that server B's index version is 
> different from its own
> 7. server A begins replicating a new copy of the index from server B using 
> the (legacy?) replication handler
> 8. the original index on server A was not deleted so it runs out of disk 
> space mid-replication
> 9. server A throws an error, deletes the partially replicated index, and then 
> tries to replicate again
> 
> At this point I think steps 6  => 9 will loop forever
> 
> If the actual errors from solr.log are useful let me know, not doing that now 
> for brevity since this email is already pretty long.  In a nutshell and in 
> order, on server A I can find the error that took it down, the post-recovery 
> instruction from ZK to unregister itself as a leader, the corrupt index error 
> message, and then the (start - whoops, out of disk- stop) loop of the 
> replication messages.
> 
> I first want to ask if what I described is possible or did I get lost 
> somewhere along the way reading the docs?  Is there any reason to think that 
> solr should not do this?
> 
> If my version of events is feasible I have a few other questions:
> 
> 1. What happens to the docs that were indexed on server A but never 
> replicated to server B before the failure?  Assuming that the replica on 
> server A were to complete the recovery process would those docs appear in the 
> index or are they gone for good?
> 
> 2. I am guessing that the corrupt replica on server A is not deleted because 
> it is still viable, if server B had a catastrophic failure you could pick up 
> the pieces from server A.  If so is this a configurable option somewhere?  
> I'd rather take my chances on server B going down before replication finishes 
> than be stuck in this state and have to manually intervene.  Besides, I have 
> disaster recovery backups for exactly this situation.
> 
> 3. Is there anything I can do to prevent this type of failure?  It seems to 
> me that if server B gets even 1 new document as a leader the shard will enter 
> this state.  My only thought right now is to try to stop sending documents 
> for indexing the instant a leader goes down but on the surface this solution 
> sounds tough to implement perfectly (and it would have to be perfect).
> 
> If you got this far thanks for sticking with me.
> 
> Cheers,
> Brian
> 



Re: replica recovery

2015-10-27 Thread Brian Scholl
Both are excellent points and I will look to implement them.  Particularly I 
wonder if a respectable increase to the numRecordsToKeep param could solve this 
problem entirely.  

Thanks!



> On Oct 27, 2015, at 20:50, Jeff Wartes  wrote:
> 
> 
> On the face of it, your scenario seems plausible. I can offer two pieces
> of info that may or may not help you:
> 
> 1. A write request to Solr will not be acknowledged until an attempt has
> been made to write to all relevant replicas. So, B won’t ever be missing
> updates that were applied to A, unless communication with B was disrupted
> somehow at the time of the update request. You can add a min_rf param to
> your write request, in which case the response will tell you how many
> replicas received the update, but it’s still up to your indexer client to
> decide what to do if that’s less than your replication factor.
> 
> See 
> https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+
> Tolerance for more info.
> 
> 2. There are two forms of replication. The usual thing is for the leader
> for each shard to write an update to all replicas before acknowledging the
> write itself, as above. If a replica is less than N docs behind the
> leader, the leader can replay those docs to the replica from its
> transaction log. If a replica is more than N docs behind though, it falls
> back to the replication handler recovery mode you mention, and attempts to
> re-sync the whole shard from the leader.
> The default N for this is 100, which is pretty low for a high-update-rate
> index. It can be changed by increasing the size of the transaction log,
> (via numRecordsToKeep) but be aware that a large transaction log size can
> delay node restart.
> 
> See 
> https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConf
> ig#UpdateHandlersinSolrConfig-TransactionLog for more info.
> 
> 
> Hope some of that helps, I don’t know a way to say
> delete-first-on-recovery.
> 
> 
> 
> On 10/27/15, 5:21 PM, "Brian Scholl"  wrote:
> 
>> Whoops, in the description of my setup that should say 2 replicas per
>> shard.  Every server has a replica.
>> 
>> 
>>> On Oct 27, 2015, at 20:16, Brian Scholl  wrote:
>>> 
>>> Hello,
>>> 
>>> I am experiencing a failure mode where a replica is unable to recover
>>> and it will try to do so forever.  In writing this email I want to make
>>> sure that I haven't missed anything obvious or missed a configurable
>>> option that could help.  If something about this looks funny, I would
>>> really like to hear from you.
>>> 
>>> Relevant details:
>>> - solr 5.3.1
>>> - java 1.8
>>> - ubuntu linux 14.04 lts
>>> - the cluster is composed of 1 SolrCloud collection with 100 shards
>>> backed by a 3 node zookeeper ensemble
>>> - there are 200 solr servers in the cluster, 1 replica per shard
>>> - a shard replica is larger than 50% of the available disk
>>> - ~40M docs added per day, total indexing time is 8-10 hours spread
>>> over the day
>>> - autoCommit is set to 15s
>>> - softCommit is not defined
>>> 
>>> I think I have traced the failure to the following set of events but
>>> would appreciate feedback:
>>> 
>>> 1. new documents are being indexed
>>> 2. the leader of a shard, server A, fails for any reason (java crashes,
>>> times out with zookeeper, etc)
>>> 3. zookeeper promotes the other replica of the shard, server B, to the
>>> leader position and indexing resumes
>>> 4. server A comes back online (typically 10s of seconds later) and
>>> reports to zookeeper
>>> 5. zookeeper tells server A that it is no longer the leader and to sync
>>> with server B
>>> 6. server A checks with server B but finds that server B's index
>>> version is different from its own
>>> 7. server A begins replicating a new copy of the index from server B
>>> using the (legacy?) replication handler
>>> 8. the original index on server A was not deleted so it runs out of
>>> disk space mid-replication
>>> 9. server A throws an error, deletes the partially replicated index,
>>> and then tries to replicate again
>>> 
>>> At this point I think steps 6  => 9 will loop forever
>>> 
>>> If the actual errors from solr.log are useful let me know, not doing
>>> that now for brevity since this email is already pretty long.  In a
>>> nutshell and in order, on server A I can find the error that took it
>>> down, the post-recovery instru

Re: replica recovery

2015-11-19 Thread Brian Scholl
I have opted to modify the number and size of transaction logs that I keep to 
resolve the original issue I described.  In so doing I think I have created a 
new problem, feedback is appreciated.

Here are the new updateLog settings:


  ${solr.ulog.dir:}
  ${solr.ulog.numVersionBuckets:65536}
  1000
  5760


First I want to make sure I understand what these settings do:
numRecordsToKeep: per transaction log file keep this number of documents
maxNumLogsToKeep: retain this number of transaction log files total

During my testing I thought I observed that a new tlog is created every time 
auto-commit is triggered (every 15 seconds in my case) so I set 
maxNumLogsToKeep high enough to contain an entire days worth of updates.   
Knowing that I could potentially need to bulk load some data I set 
numRecordsToKeep higher than my max throughput per replica for 15 seconds.

The problem that I think this has created is I am now running out of file 
descriptors on the servers.  After indexing new documents for a couple hours a 
some servers (not all) will start logging this error rapidly:

73021439 WARN  
(qtp1476011703-18-acceptor-0@6d5514d9-ServerConnector@6392e703{HTTP/1.1}{0.0.0.0:8983})
 [   ] o.e.j.s.ServerConnector
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at 
org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
at 
org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)

The output of ulimit -n for the user running the solr process is 1024.  I am 
pretty sure I can prevent this error from occurring  by increasing the limit on 
each server but it isn't clear to me how high it should be or if raising the 
limit will cause new problems.

Any advice you could provide in this situation would be awesome!

Cheers,
Brian



> On Oct 27, 2015, at 20:50, Jeff Wartes  wrote:
> 
> 
> On the face of it, your scenario seems plausible. I can offer two pieces
> of info that may or may not help you:
> 
> 1. A write request to Solr will not be acknowledged until an attempt has
> been made to write to all relevant replicas. So, B won’t ever be missing
> updates that were applied to A, unless communication with B was disrupted
> somehow at the time of the update request. You can add a min_rf param to
> your write request, in which case the response will tell you how many
> replicas received the update, but it’s still up to your indexer client to
> decide what to do if that’s less than your replication factor.
> 
> See 
> https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+
> Tolerance for more info.
> 
> 2. There are two forms of replication. The usual thing is for the leader
> for each shard to write an update to all replicas before acknowledging the
> write itself, as above. If a replica is less than N docs behind the
> leader, the leader can replay those docs to the replica from its
> transaction log. If a replica is more than N docs behind though, it falls
> back to the replication handler recovery mode you mention, and attempts to
> re-sync the whole shard from the leader.
> The default N for this is 100, which is pretty low for a high-update-rate
> index. It can be changed by increasing the size of the transaction log,
> (via numRecordsToKeep) but be aware that a large transaction log size can
> delay node restart.
> 
> See 
> https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConf
> ig#UpdateHandlersinSolrConfig-TransactionLog for more info.
> 
> 
> Hope some of that helps, I don’t know a way to say
> delete-first-on-recovery.
> 
> 
> 
> On 10/27/15, 5:21 PM, "Brian Scholl"  wrote:
> 
>> Whoops, in the description of my setup that should say 2 replicas per
>> shard.  Every server has a replica.
>> 
>> 
>>> On Oct 27, 2015, at 20:16, Brian Scholl  wrote:
>>> 
>>> Hello,
>>> 
>>> I am experiencing a failure mode where a replica is unable to recover
>>> and it will try to do so forever.  In writing this email I want to make
>>> sure that I haven't missed anything obvious or missed a configurable
>>> option that could help.  If something about this looks funny, I would
>>> really like to hear from you.
>>> 
>>> Relevant details:
>>>

Re: replica recovery

2015-11-19 Thread Brian Scholl
Hey Erick,

Thanks for the reply.  

I plan on rebuilding my cluster soon with more nodes so that the index size 
(including tlogs) is under 50% of the available disk at a minimum, ideally we 
will shoot for under 33% budget permitting.  I think I now understand the 
problem that managing this resource will solve and I appreciate your (and 
Shawn's) feedback. 

I would still like to increase the number of transaction logs retained so that 
shard recovery (outside of long term failures) is faster than replicating the 
entire shard from the leader.  I understand that this is an optimization and 
not a 
solution for replication.  If I'm being thick about this call me out :)

Cheers,
Brian




> On Nov 19, 2015, at 11:30, Erick Erickson  wrote:
> 
> First, every time you autocommit there _should_ be a new
> tlog created. A hard commit truncates the tlog by design.
> 
> My guess (not based on knowing the code) is that
> Real Time Get needs file handle open to the tlog files
> and you'll have a bunch of them. Lots and lots and lots. Thus
> the too many file handles is just waiting out there for you.
> 
> However, this entire approach is, IMO, not going to solve
> anything for you. Or rather other problems will come out
> of the woodwork.
> 
> To whit: At some point, you _will_ need to have at least as
> much free space on your disk as your current index occupies,
> even without recovery. Background merging of segments can
> effectively do the same thing as an optimize step, which rewrites
> the entire index to new segments before deleting the old
> segments. So far you haven't hit that situation in steady-state,
> but you will.
> 
> Simply put, I think you're wasting your time pursuing the tlog
> option. You must have bigger disks or smaller indexes such
> that there is at least as much free disk space at all times as
> your index occupies. In fact if the tlogs are on the same
> drive as your index, the tlog option you're pursuing is making
> the situation _worse_ by making running out of disk space
> during a merge even more likely.
> 
> So unless there's a compelling reason you can't use bigger
> disks, IMO you'll waste lots and lots of valuable
> engineering time before... buying bigger disks.
> 
> Best,
> Erick
> 
> On Thu, Nov 19, 2015 at 6:21 AM, Brian Scholl  wrote:
>> I have opted to modify the number and size of transaction logs that I keep 
>> to resolve the original issue I described.  In so doing I think I have 
>> created a new problem, feedback is appreciated.
>> 
>> Here are the new updateLog settings:
>> 
>>
>>  ${solr.ulog.dir:}
>>  ${solr.ulog.numVersionBuckets:65536}
>>  1000
>>  5760
>>
>> 
>> First I want to make sure I understand what these settings do:
>>numRecordsToKeep: per transaction log file keep this number of 
>> documents
>>maxNumLogsToKeep: retain this number of transaction log files total
>> 
>> During my testing I thought I observed that a new tlog is created every time 
>> auto-commit is triggered (every 15 seconds in my case) so I set 
>> maxNumLogsToKeep high enough to contain an entire days worth of updates.   
>> Knowing that I could potentially need to bulk load some data I set 
>> numRecordsToKeep higher than my max throughput per replica for 15 seconds.
>> 
>> The problem that I think this has created is I am now running out of file 
>> descriptors on the servers.  After indexing new documents for a couple hours 
>> a some servers (not all) will start logging this error rapidly:
>> 
>> 73021439 WARN  
>> (qtp1476011703-18-acceptor-0@6d5514d9-ServerConnector@6392e703{HTTP/1.1}{0.0.0.0:8983})
>>  [   ] o.e.j.s.ServerConnector
>> java.io.IOException: Too many open files
>>at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>>at 
>> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
>>at 
>> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
>>at 
>> org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
>>at 
>> org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500)
>>at 
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>>at 
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>>at java.lang.Thread.run(Thread.java:745)
>> 
>> The output of ulimit -n for the user running the solr process is 1024.  I am 
>> pretty sure I can prevent this error from oc

Re: replica recovery

2015-11-19 Thread Brian Scholl
Primarily our outages are caused by Java crashes or really long GC pauses, in 
short not all of our developers have a good sense of what types of queries are 
unsafe if abused (for example, cursorMark or start=).  

Honestly, stability of the JVM is another task I have coming up.  I agree that 
recovery should be uncommon, we're just not where we need to be yet.

Cheers,
Brian




> On Nov 19, 2015, at 15:14, Erick Erickson  wrote:
> 
> bq: I would still like to increase the number of transaction logs
> retained so that shard recovery (outside of long term failures) is
> faster than replicating the entire shard from the leader
> 
> That's legitimate, but (you knew that was coming!) nodes having to
> recover _should_ be a rare event. Is this happening often or is it a
> result of testing? If nodes are going into recovery for no good reason
> (i.e. network being unplugged, whatever) I'd put some energy into
> understanding that as well. Perhaps there are operational type things
> that should be addressed (e.g. stop indexing, wait for commit, _then_
> bounce Solr instances).
> 
> 
> Best,
> Erick
> 
> 
> 
> On Thu, Nov 19, 2015 at 10:17 AM, Brian Scholl  wrote:
>> Hey Erick,
>> 
>> Thanks for the reply.
>> 
>> I plan on rebuilding my cluster soon with more nodes so that the index size 
>> (including tlogs) is under 50% of the available disk at a minimum, ideally 
>> we will shoot for under 33% budget permitting.  I think I now understand the 
>> problem that managing this resource will solve and I appreciate your (and 
>> Shawn's) feedback.
>> 
>> I would still like to increase the number of transaction logs retained so 
>> that shard recovery (outside of long term failures) is faster than 
>> replicating the entire shard from the leader.  I understand that this is an 
>> optimization and not a
>> solution for replication.  If I'm being thick about this call me out :)
>> 
>> Cheers,
>> Brian
>> 
>> 
>> 
>> 
>>> On Nov 19, 2015, at 11:30, Erick Erickson  wrote:
>>> 
>>> First, every time you autocommit there _should_ be a new
>>> tlog created. A hard commit truncates the tlog by design.
>>> 
>>> My guess (not based on knowing the code) is that
>>> Real Time Get needs file handle open to the tlog files
>>> and you'll have a bunch of them. Lots and lots and lots. Thus
>>> the too many file handles is just waiting out there for you.
>>> 
>>> However, this entire approach is, IMO, not going to solve
>>> anything for you. Or rather other problems will come out
>>> of the woodwork.
>>> 
>>> To whit: At some point, you _will_ need to have at least as
>>> much free space on your disk as your current index occupies,
>>> even without recovery. Background merging of segments can
>>> effectively do the same thing as an optimize step, which rewrites
>>> the entire index to new segments before deleting the old
>>> segments. So far you haven't hit that situation in steady-state,
>>> but you will.
>>> 
>>> Simply put, I think you're wasting your time pursuing the tlog
>>> option. You must have bigger disks or smaller indexes such
>>> that there is at least as much free disk space at all times as
>>> your index occupies. In fact if the tlogs are on the same
>>> drive as your index, the tlog option you're pursuing is making
>>> the situation _worse_ by making running out of disk space
>>> during a merge even more likely.
>>> 
>>> So unless there's a compelling reason you can't use bigger
>>> disks, IMO you'll waste lots and lots of valuable
>>> engineering time before... buying bigger disks.
>>> 
>>> Best,
>>> Erick
>>> 
>>> On Thu, Nov 19, 2015 at 6:21 AM, Brian Scholl  wrote:
>>>> I have opted to modify the number and size of transaction logs that I keep 
>>>> to resolve the original issue I described.  In so doing I think I have 
>>>> created a new problem, feedback is appreciated.
>>>> 
>>>> Here are the new updateLog settings:
>>>> 
>>>>   
>>>> ${solr.ulog.dir:}
>>>> >>> name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}
>>>> 1000
>>>> 5760
>>>>   
>>>> 
>>>> First I want to make sure I understand what these settings do:
>>>>   numRecordsToKeep: per transaction log file keep this number