Re: replica recovery

Erick Erickson Thu, 19 Nov 2015 15:04:25 -0800

Right, I've managed to double the memory required by Solr
by varying the _query_. Siiiggggh.


There are some JIRAs out there (don't have them readily available, sorry)
that short-circuit queries that take "too long", and there are some others
to short circuit "expensive" queries. I believe this latter has been
talked about
but not committed.

Best,
Erick

On Thu, Nov 19, 2015 at 12:21 PM, Brian Scholl <bsch...@legendary.com> wrote:
> Primarily our outages are caused by Java crashes or really long GC pauses, in 
> short not all of our developers have a good sense of what types of queries 
> are unsafe if abused (for example, cursorMark or start=).
>
> Honestly, stability of the JVM is another task I have coming up.  I agree 
> that recovery should be uncommon, we're just not where we need to be yet.
>
> Cheers,
> Brian
>
>
>
>
>> On Nov 19, 2015, at 15:14, Erick Erickson <erickerick...@gmail.com> wrote:
>>
>> bq: I would still like to increase the number of transaction logs
>> retained so that shard recovery (outside of long term failures) is
>> faster than replicating the entire shard from the leader
>>
>> That's legitimate, but (you knew that was coming!) nodes having to
>> recover _should_ be a rare event. Is this happening often or is it a
>> result of testing? If nodes are going into recovery for no good reason
>> (i.e. network being unplugged, whatever) I'd put some energy into
>> understanding that as well. Perhaps there are operational type things
>> that should be addressed (e.g. stop indexing, wait for commit, _then_
>> bounce Solr instances).....
>>
>>
>> Best,
>> Erick
>>
>>
>>
>> On Thu, Nov 19, 2015 at 10:17 AM, Brian Scholl <bsch...@legendary.com> wrote:
>>> Hey Erick,
>>>
>>> Thanks for the reply.
>>>
>>> I plan on rebuilding my cluster soon with more nodes so that the index size 
>>> (including tlogs) is under 50% of the available disk at a minimum, ideally 
>>> we will shoot for under 33% budget permitting.  I think I now understand 
>>> the problem that managing this resource will solve and I appreciate your 
>>> (and Shawn's) feedback.
>>>
>>> I would still like to increase the number of transaction logs retained so 
>>> that shard recovery (outside of long term failures) is faster than 
>>> replicating the entire shard from the leader.  I understand that this is an 
>>> optimization and not a
>>> solution for replication.  If I'm being thick about this call me out :)
>>>
>>> Cheers,
>>> Brian
>>>
>>>
>>>
>>>
>>>> On Nov 19, 2015, at 11:30, Erick Erickson <erickerick...@gmail.com> wrote:
>>>>
>>>> First, every time you autocommit there _should_ be a new
>>>> tlog created. A hard commit truncates the tlog by design.
>>>>
>>>> My guess (not based on knowing the code) is that
>>>> Real Time Get needs file handle open to the tlog files
>>>> and you'll have a bunch of them. Lots and lots and lots. Thus
>>>> the too many file handles is just waiting out there for you.
>>>>
>>>> However, this entire approach is, IMO, not going to solve
>>>> anything for you. Or rather other problems will come out
>>>> of the woodwork.
>>>>
>>>> To whit: At some point, you _will_ need to have at least as
>>>> much free space on your disk as your current index occupies,
>>>> even without recovery. Background merging of segments can
>>>> effectively do the same thing as an optimize step, which rewrites
>>>> the entire index to new segments before deleting the old
>>>> segments. So far you haven't hit that situation in steady-state,
>>>> but you will.
>>>>
>>>> Simply put, I think you're wasting your time pursuing the tlog
>>>> option. You must have bigger disks or smaller indexes such
>>>> that there is at least as much free disk space at all times as
>>>> your index occupies. In fact if the tlogs are on the same
>>>> drive as your index, the tlog option you're pursuing is making
>>>> the situation _worse_ by making running out of disk space
>>>> during a merge even more likely.
>>>>
>>>> So unless there's a compelling reason you can't use bigger
>>>> disks, IMO you'll waste lots and lots of valuable
>>>> engineering time before... buying bigger disks.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Thu, Nov 19, 2015 at 6:21 AM, Brian Scholl <bsch...@legendary.com> 
>>>> wrote:
>>>>> I have opted to modify the number and size of transaction logs that I 
>>>>> keep to resolve the original issue I described.  In so doing I think I 
>>>>> have created a new problem, feedback is appreciated.
>>>>>
>>>>> Here are the new updateLog settings:
>>>>>
>>>>>   <updateLog>
>>>>>     <str name="dir">${solr.ulog.dir:}</str>
>>>>>     <int 
>>>>> name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}</int>
>>>>>     <int name="numRecordsToKeep">10000000</int>
>>>>>     <int name="maxNumLogsToKeep">5760</int>
>>>>>   </updateLog>
>>>>>
>>>>> First I want to make sure I understand what these settings do:
>>>>>       numRecordsToKeep: per transaction log file keep this number of 
>>>>> documents
>>>>>       maxNumLogsToKeep: retain this number of transaction log files total
>>>>>
>>>>> During my testing I thought I observed that a new tlog is created every 
>>>>> time auto-commit is triggered (every 15 seconds in my case) so I set 
>>>>> maxNumLogsToKeep high enough to contain an entire days worth of updates.  
>>>>>  Knowing that I could potentially need to bulk load some data I set 
>>>>> numRecordsToKeep higher than my max throughput per replica for 15 seconds.
>>>>>
>>>>> The problem that I think this has created is I am now running out of file 
>>>>> descriptors on the servers.  After indexing new documents for a couple 
>>>>> hours a some servers (not all) will start logging this error rapidly:
>>>>>
>>>>> 73021439 WARN  
>>>>> (qtp1476011703-18-acceptor-0@6d5514d9-ServerConnector@6392e703{HTTP/1.1}{0.0.0.0:8983})
>>>>>  [   ] o.e.j.s.ServerConnector
>>>>> java.io.IOException: Too many open files
>>>>>       at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>>>>>       at 
>>>>> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
>>>>>       at 
>>>>> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
>>>>>       at 
>>>>> org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
>>>>>       at 
>>>>> org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500)
>>>>>       at 
>>>>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>>>>>       at 
>>>>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>>>>>       at java.lang.Thread.run(Thread.java:745)
>>>>>
>>>>> The output of ulimit -n for the user running the solr process is 1024.  I 
>>>>> am pretty sure I can prevent this error from occurring  by increasing the 
>>>>> limit on each server but it isn't clear to me how high it should be or if 
>>>>> raising the limit will cause new problems.
>>>>>
>>>>> Any advice you could provide in this situation would be awesome!
>>>>>
>>>>> Cheers,
>>>>> Brian
>>>>>
>>>>>
>>>>>
>>>>>> On Oct 27, 2015, at 20:50, Jeff Wartes <jwar...@whitepages.com> wrote:
>>>>>>
>>>>>>
>>>>>> On the face of it, your scenario seems plausible. I can offer two pieces
>>>>>> of info that may or may not help you:
>>>>>>
>>>>>> 1. A write request to Solr will not be acknowledged until an attempt has
>>>>>> been made to write to all relevant replicas. So, B won’t ever be missing
>>>>>> updates that were applied to A, unless communication with B was disrupted
>>>>>> somehow at the time of the update request. You can add a min_rf param to
>>>>>> your write request, in which case the response will tell you how many
>>>>>> replicas received the update, but it’s still up to your indexer client to
>>>>>> decide what to do if that’s less than your replication factor.
>>>>>>
>>>>>> See
>>>>>> https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+
>>>>>> Tolerance for more info.
>>>>>>
>>>>>> 2. There are two forms of replication. The usual thing is for the leader
>>>>>> for each shard to write an update to all replicas before acknowledging 
>>>>>> the
>>>>>> write itself, as above. If a replica is less than N docs behind the
>>>>>> leader, the leader can replay those docs to the replica from its
>>>>>> transaction log. If a replica is more than N docs behind though, it falls
>>>>>> back to the replication handler recovery mode you mention, and attempts 
>>>>>> to
>>>>>> re-sync the whole shard from the leader.
>>>>>> The default N for this is 100, which is pretty low for a high-update-rate
>>>>>> index. It can be changed by increasing the size of the transaction log,
>>>>>> (via numRecordsToKeep) but be aware that a large transaction log size can
>>>>>> delay node restart.
>>>>>>
>>>>>> See
>>>>>> https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConf
>>>>>> ig#UpdateHandlersinSolrConfig-TransactionLog for more info.
>>>>>>
>>>>>>
>>>>>> Hope some of that helps, I don’t know a way to say
>>>>>> delete-first-on-recovery.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10/27/15, 5:21 PM, "Brian Scholl" <bsch...@legendary.com> wrote:
>>>>>>
>>>>>>> Whoops, in the description of my setup that should say 2 replicas per
>>>>>>> shard.  Every server has a replica.
>>>>>>>
>>>>>>>
>>>>>>>> On Oct 27, 2015, at 20:16, Brian Scholl <bsch...@legendary.com> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I am experiencing a failure mode where a replica is unable to recover
>>>>>>>> and it will try to do so forever.  In writing this email I want to make
>>>>>>>> sure that I haven't missed anything obvious or missed a configurable
>>>>>>>> option that could help.  If something about this looks funny, I would
>>>>>>>> really like to hear from you.
>>>>>>>>
>>>>>>>> Relevant details:
>>>>>>>> - solr 5.3.1
>>>>>>>> - java 1.8
>>>>>>>> - ubuntu linux 14.04 lts
>>>>>>>> - the cluster is composed of 1 SolrCloud collection with 100 shards
>>>>>>>> backed by a 3 node zookeeper ensemble
>>>>>>>> - there are 200 solr servers in the cluster, 1 replica per shard
>>>>>>>> - a shard replica is larger than 50% of the available disk
>>>>>>>> - ~40M docs added per day, total indexing time is 8-10 hours spread
>>>>>>>> over the day
>>>>>>>> - autoCommit is set to 15s
>>>>>>>> - softCommit is not defined
>>>>>>>>
>>>>>>>> I think I have traced the failure to the following set of events but
>>>>>>>> would appreciate feedback:
>>>>>>>>
>>>>>>>> 1. new documents are being indexed
>>>>>>>> 2. the leader of a shard, server A, fails for any reason (java crashes,
>>>>>>>> times out with zookeeper, etc)
>>>>>>>> 3. zookeeper promotes the other replica of the shard, server B, to the
>>>>>>>> leader position and indexing resumes
>>>>>>>> 4. server A comes back online (typically 10s of seconds later) and
>>>>>>>> reports to zookeeper
>>>>>>>> 5. zookeeper tells server A that it is no longer the leader and to sync
>>>>>>>> with server B
>>>>>>>> 6. server A checks with server B but finds that server B's index
>>>>>>>> version is different from its own
>>>>>>>> 7. server A begins replicating a new copy of the index from server B
>>>>>>>> using the (legacy?) replication handler
>>>>>>>> 8. the original index on server A was not deleted so it runs out of
>>>>>>>> disk space mid-replication
>>>>>>>> 9. server A throws an error, deletes the partially replicated index,
>>>>>>>> and then tries to replicate again
>>>>>>>>
>>>>>>>> At this point I think steps 6  => 9 will loop forever
>>>>>>>>
>>>>>>>> If the actual errors from solr.log are useful let me know, not doing
>>>>>>>> that now for brevity since this email is already pretty long.  In a
>>>>>>>> nutshell and in order, on server A I can find the error that took it
>>>>>>>> down, the post-recovery instruction from ZK to unregister itself as a
>>>>>>>> leader, the corrupt index error message, and then the (start - whoops,
>>>>>>>> out of disk- stop) loop of the replication messages.
>>>>>>>>
>>>>>>>> I first want to ask if what I described is possible or did I get lost
>>>>>>>> somewhere along the way reading the docs?  Is there any reason to think
>>>>>>>> that solr should not do this?
>>>>>>>>
>>>>>>>> If my version of events is feasible I have a few other questions:
>>>>>>>>
>>>>>>>> 1. What happens to the docs that were indexed on server A but never
>>>>>>>> replicated to server B before the failure?  Assuming that the replica 
>>>>>>>> on
>>>>>>>> server A were to complete the recovery process would those docs appear
>>>>>>>> in the index or are they gone for good?
>>>>>>>>
>>>>>>>> 2. I am guessing that the corrupt replica on server A is not deleted
>>>>>>>> because it is still viable, if server B had a catastrophic failure you
>>>>>>>> could pick up the pieces from server A.  If so is this a configurable
>>>>>>>> option somewhere?  I'd rather take my chances on server B going down
>>>>>>>> before replication finishes than be stuck in this state and have to
>>>>>>>> manually intervene.  Besides, I have disaster recovery backups for
>>>>>>>> exactly this situation.
>>>>>>>>
>>>>>>>> 3. Is there anything I can do to prevent this type of failure?  It
>>>>>>>> seems to me that if server B gets even 1 new document as a leader the
>>>>>>>> shard will enter this state.  My only thought right now is to try to
>>>>>>>> stop sending documents for indexing the instant a leader goes down but
>>>>>>>> on the surface this solution sounds tough to implement perfectly (and 
>>>>>>>> it
>>>>>>>> would have to be perfect).
>>>>>>>>
>>>>>>>> If you got this far thanks for sticking with me.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Brian
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>

Re: replica recovery

Reply via email to