Right, I've managed to double the memory required by Solr by varying the _query_. Siiiggggh.
There are some JIRAs out there (don't have them readily available, sorry) that short-circuit queries that take "too long", and there are some others to short circuit "expensive" queries. I believe this latter has been talked about but not committed. Best, Erick On Thu, Nov 19, 2015 at 12:21 PM, Brian Scholl <bsch...@legendary.com> wrote: > Primarily our outages are caused by Java crashes or really long GC pauses, in > short not all of our developers have a good sense of what types of queries > are unsafe if abused (for example, cursorMark or start=). > > Honestly, stability of the JVM is another task I have coming up. I agree > that recovery should be uncommon, we're just not where we need to be yet. > > Cheers, > Brian > > > > >> On Nov 19, 2015, at 15:14, Erick Erickson <erickerick...@gmail.com> wrote: >> >> bq: I would still like to increase the number of transaction logs >> retained so that shard recovery (outside of long term failures) is >> faster than replicating the entire shard from the leader >> >> That's legitimate, but (you knew that was coming!) nodes having to >> recover _should_ be a rare event. Is this happening often or is it a >> result of testing? If nodes are going into recovery for no good reason >> (i.e. network being unplugged, whatever) I'd put some energy into >> understanding that as well. Perhaps there are operational type things >> that should be addressed (e.g. stop indexing, wait for commit, _then_ >> bounce Solr instances)..... >> >> >> Best, >> Erick >> >> >> >> On Thu, Nov 19, 2015 at 10:17 AM, Brian Scholl <bsch...@legendary.com> wrote: >>> Hey Erick, >>> >>> Thanks for the reply. >>> >>> I plan on rebuilding my cluster soon with more nodes so that the index size >>> (including tlogs) is under 50% of the available disk at a minimum, ideally >>> we will shoot for under 33% budget permitting. I think I now understand >>> the problem that managing this resource will solve and I appreciate your >>> (and Shawn's) feedback. >>> >>> I would still like to increase the number of transaction logs retained so >>> that shard recovery (outside of long term failures) is faster than >>> replicating the entire shard from the leader. I understand that this is an >>> optimization and not a >>> solution for replication. If I'm being thick about this call me out :) >>> >>> Cheers, >>> Brian >>> >>> >>> >>> >>>> On Nov 19, 2015, at 11:30, Erick Erickson <erickerick...@gmail.com> wrote: >>>> >>>> First, every time you autocommit there _should_ be a new >>>> tlog created. A hard commit truncates the tlog by design. >>>> >>>> My guess (not based on knowing the code) is that >>>> Real Time Get needs file handle open to the tlog files >>>> and you'll have a bunch of them. Lots and lots and lots. Thus >>>> the too many file handles is just waiting out there for you. >>>> >>>> However, this entire approach is, IMO, not going to solve >>>> anything for you. Or rather other problems will come out >>>> of the woodwork. >>>> >>>> To whit: At some point, you _will_ need to have at least as >>>> much free space on your disk as your current index occupies, >>>> even without recovery. Background merging of segments can >>>> effectively do the same thing as an optimize step, which rewrites >>>> the entire index to new segments before deleting the old >>>> segments. So far you haven't hit that situation in steady-state, >>>> but you will. >>>> >>>> Simply put, I think you're wasting your time pursuing the tlog >>>> option. You must have bigger disks or smaller indexes such >>>> that there is at least as much free disk space at all times as >>>> your index occupies. In fact if the tlogs are on the same >>>> drive as your index, the tlog option you're pursuing is making >>>> the situation _worse_ by making running out of disk space >>>> during a merge even more likely. >>>> >>>> So unless there's a compelling reason you can't use bigger >>>> disks, IMO you'll waste lots and lots of valuable >>>> engineering time before... buying bigger disks. >>>> >>>> Best, >>>> Erick >>>> >>>> On Thu, Nov 19, 2015 at 6:21 AM, Brian Scholl <bsch...@legendary.com> >>>> wrote: >>>>> I have opted to modify the number and size of transaction logs that I >>>>> keep to resolve the original issue I described. In so doing I think I >>>>> have created a new problem, feedback is appreciated. >>>>> >>>>> Here are the new updateLog settings: >>>>> >>>>> <updateLog> >>>>> <str name="dir">${solr.ulog.dir:}</str> >>>>> <int >>>>> name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}</int> >>>>> <int name="numRecordsToKeep">10000000</int> >>>>> <int name="maxNumLogsToKeep">5760</int> >>>>> </updateLog> >>>>> >>>>> First I want to make sure I understand what these settings do: >>>>> numRecordsToKeep: per transaction log file keep this number of >>>>> documents >>>>> maxNumLogsToKeep: retain this number of transaction log files total >>>>> >>>>> During my testing I thought I observed that a new tlog is created every >>>>> time auto-commit is triggered (every 15 seconds in my case) so I set >>>>> maxNumLogsToKeep high enough to contain an entire days worth of updates. >>>>> Knowing that I could potentially need to bulk load some data I set >>>>> numRecordsToKeep higher than my max throughput per replica for 15 seconds. >>>>> >>>>> The problem that I think this has created is I am now running out of file >>>>> descriptors on the servers. After indexing new documents for a couple >>>>> hours a some servers (not all) will start logging this error rapidly: >>>>> >>>>> 73021439 WARN >>>>> (qtp1476011703-18-acceptor-0@6d5514d9-ServerConnector@6392e703{HTTP/1.1}{0.0.0.0:8983}) >>>>> [ ] o.e.j.s.ServerConnector >>>>> java.io.IOException: Too many open files >>>>> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) >>>>> at >>>>> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) >>>>> at >>>>> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) >>>>> at >>>>> org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377) >>>>> at >>>>> org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500) >>>>> at >>>>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) >>>>> at >>>>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) >>>>> at java.lang.Thread.run(Thread.java:745) >>>>> >>>>> The output of ulimit -n for the user running the solr process is 1024. I >>>>> am pretty sure I can prevent this error from occurring by increasing the >>>>> limit on each server but it isn't clear to me how high it should be or if >>>>> raising the limit will cause new problems. >>>>> >>>>> Any advice you could provide in this situation would be awesome! >>>>> >>>>> Cheers, >>>>> Brian >>>>> >>>>> >>>>> >>>>>> On Oct 27, 2015, at 20:50, Jeff Wartes <jwar...@whitepages.com> wrote: >>>>>> >>>>>> >>>>>> On the face of it, your scenario seems plausible. I can offer two pieces >>>>>> of info that may or may not help you: >>>>>> >>>>>> 1. A write request to Solr will not be acknowledged until an attempt has >>>>>> been made to write to all relevant replicas. So, B won’t ever be missing >>>>>> updates that were applied to A, unless communication with B was disrupted >>>>>> somehow at the time of the update request. You can add a min_rf param to >>>>>> your write request, in which case the response will tell you how many >>>>>> replicas received the update, but it’s still up to your indexer client to >>>>>> decide what to do if that’s less than your replication factor. >>>>>> >>>>>> See >>>>>> https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+ >>>>>> Tolerance for more info. >>>>>> >>>>>> 2. There are two forms of replication. The usual thing is for the leader >>>>>> for each shard to write an update to all replicas before acknowledging >>>>>> the >>>>>> write itself, as above. If a replica is less than N docs behind the >>>>>> leader, the leader can replay those docs to the replica from its >>>>>> transaction log. If a replica is more than N docs behind though, it falls >>>>>> back to the replication handler recovery mode you mention, and attempts >>>>>> to >>>>>> re-sync the whole shard from the leader. >>>>>> The default N for this is 100, which is pretty low for a high-update-rate >>>>>> index. It can be changed by increasing the size of the transaction log, >>>>>> (via numRecordsToKeep) but be aware that a large transaction log size can >>>>>> delay node restart. >>>>>> >>>>>> See >>>>>> https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConf >>>>>> ig#UpdateHandlersinSolrConfig-TransactionLog for more info. >>>>>> >>>>>> >>>>>> Hope some of that helps, I don’t know a way to say >>>>>> delete-first-on-recovery. >>>>>> >>>>>> >>>>>> >>>>>> On 10/27/15, 5:21 PM, "Brian Scholl" <bsch...@legendary.com> wrote: >>>>>> >>>>>>> Whoops, in the description of my setup that should say 2 replicas per >>>>>>> shard. Every server has a replica. >>>>>>> >>>>>>> >>>>>>>> On Oct 27, 2015, at 20:16, Brian Scholl <bsch...@legendary.com> wrote: >>>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> I am experiencing a failure mode where a replica is unable to recover >>>>>>>> and it will try to do so forever. In writing this email I want to make >>>>>>>> sure that I haven't missed anything obvious or missed a configurable >>>>>>>> option that could help. If something about this looks funny, I would >>>>>>>> really like to hear from you. >>>>>>>> >>>>>>>> Relevant details: >>>>>>>> - solr 5.3.1 >>>>>>>> - java 1.8 >>>>>>>> - ubuntu linux 14.04 lts >>>>>>>> - the cluster is composed of 1 SolrCloud collection with 100 shards >>>>>>>> backed by a 3 node zookeeper ensemble >>>>>>>> - there are 200 solr servers in the cluster, 1 replica per shard >>>>>>>> - a shard replica is larger than 50% of the available disk >>>>>>>> - ~40M docs added per day, total indexing time is 8-10 hours spread >>>>>>>> over the day >>>>>>>> - autoCommit is set to 15s >>>>>>>> - softCommit is not defined >>>>>>>> >>>>>>>> I think I have traced the failure to the following set of events but >>>>>>>> would appreciate feedback: >>>>>>>> >>>>>>>> 1. new documents are being indexed >>>>>>>> 2. the leader of a shard, server A, fails for any reason (java crashes, >>>>>>>> times out with zookeeper, etc) >>>>>>>> 3. zookeeper promotes the other replica of the shard, server B, to the >>>>>>>> leader position and indexing resumes >>>>>>>> 4. server A comes back online (typically 10s of seconds later) and >>>>>>>> reports to zookeeper >>>>>>>> 5. zookeeper tells server A that it is no longer the leader and to sync >>>>>>>> with server B >>>>>>>> 6. server A checks with server B but finds that server B's index >>>>>>>> version is different from its own >>>>>>>> 7. server A begins replicating a new copy of the index from server B >>>>>>>> using the (legacy?) replication handler >>>>>>>> 8. the original index on server A was not deleted so it runs out of >>>>>>>> disk space mid-replication >>>>>>>> 9. server A throws an error, deletes the partially replicated index, >>>>>>>> and then tries to replicate again >>>>>>>> >>>>>>>> At this point I think steps 6 => 9 will loop forever >>>>>>>> >>>>>>>> If the actual errors from solr.log are useful let me know, not doing >>>>>>>> that now for brevity since this email is already pretty long. In a >>>>>>>> nutshell and in order, on server A I can find the error that took it >>>>>>>> down, the post-recovery instruction from ZK to unregister itself as a >>>>>>>> leader, the corrupt index error message, and then the (start - whoops, >>>>>>>> out of disk- stop) loop of the replication messages. >>>>>>>> >>>>>>>> I first want to ask if what I described is possible or did I get lost >>>>>>>> somewhere along the way reading the docs? Is there any reason to think >>>>>>>> that solr should not do this? >>>>>>>> >>>>>>>> If my version of events is feasible I have a few other questions: >>>>>>>> >>>>>>>> 1. What happens to the docs that were indexed on server A but never >>>>>>>> replicated to server B before the failure? Assuming that the replica >>>>>>>> on >>>>>>>> server A were to complete the recovery process would those docs appear >>>>>>>> in the index or are they gone for good? >>>>>>>> >>>>>>>> 2. I am guessing that the corrupt replica on server A is not deleted >>>>>>>> because it is still viable, if server B had a catastrophic failure you >>>>>>>> could pick up the pieces from server A. If so is this a configurable >>>>>>>> option somewhere? I'd rather take my chances on server B going down >>>>>>>> before replication finishes than be stuck in this state and have to >>>>>>>> manually intervene. Besides, I have disaster recovery backups for >>>>>>>> exactly this situation. >>>>>>>> >>>>>>>> 3. Is there anything I can do to prevent this type of failure? It >>>>>>>> seems to me that if server B gets even 1 new document as a leader the >>>>>>>> shard will enter this state. My only thought right now is to try to >>>>>>>> stop sending documents for indexing the instant a leader goes down but >>>>>>>> on the surface this solution sounds tough to implement perfectly (and >>>>>>>> it >>>>>>>> would have to be perfect). >>>>>>>> >>>>>>>> If you got this far thanks for sticking with me. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Brian >>>>>>>> >>>>>>> >>>>>> >>>>> >>> >