Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

Tim Vaillancourt Sat, 27 Jul 2013 14:55:16 -0700

Thanks Jack/Erick,

I don't know if this is true or not, but I've read there is a tlog persoft commit, which is then truncated by the hard commit. If this weretrue, a 15sec hard-commit with a 1sec soft-commit could generate around15~ tlogs, but I've never checked. I like Erick's scenario more if it is1 tlog/core though. I'll try to find out some more.



Another two test/things I really should try for sanity are:

- Java 1.6 and Jetty 8: just to rule things out (wouldn't actuallylaunch this way).

- ulimit for 'nofiles': the default is pretty high but why not?
- Monitor size + # of tlogs.


I'll be sure to share findings and really appreciate the help guys!

PS: This is asking a lot, but if anyone can take a look at that threaddump, or give me some pointers on what to look for in astall/thread-pile up thread dump like this, I would really appreciateit. I'm quite weak at deciphering those (I use Thread Dump Analyzer) butI'm sure it would tell a lot.



Cheers,


Tim


On 27/07/13 02:24 PM, Erick Erickson wrote:

Tim:

15 seconds isn't unreasonable, I was mostly wondering if it was hours.

Take a look at the size of the tlogs as you're indexing, you should see them
truncate every 15 seconds or so. There'll be a varying number of tlogs kept
around, although under heavy indexing I'd only expect 1 or 2 inactive ones,
the internal number is that there'll be enough tlogs kept around to
hold 100 docs.

There should only be 1 open tlog/core as I understand it. When a commit
happens (hard, openSearcher = true or false doesn't matter) the current
tlog is closed and a new one opened. Then some cleanup happens so there
are only enough tlogs kept around to hold 100 docs.

Strange, Im kind of out of ideas.
Erick

On Sat, Jul 27, 2013 at 4:41 PM, Jack Krupansky<j...@basetechnology.com>  wrote:

No hard numbers, but the general guidance is that you should set your hard
commit interval to match your expectations for how quickly nodes should come
up if they need to be restarted. Specifically, a hard commit assures that
all changes have been committed to disk and are ready for immediate access
on restart, but any and all soft commit changes since the last hard commit
must be "replayed" (reexecuted) on restart of a node.

How long does it take to replay the changes in the update log? No firm
numbers, but treat it as if all of those uncommitted updates had to be
resent and reprocessed by Solr. It's probably faster than that, but you get
the picture.

I would suggest thinking in terms of minutes rather than seconds for hard
commits 5 minutes, 10, 15, 20, 30 minutes.

Hard commits may result in kicking off segment merges, so too rapid a rate
of segment creation might cause problems or at least be counterproductive.

So, instead of 15 seconds, try 15 minutes.

OTOH, if you really need to handle 4,000 update a seconds... you are clearly
in "uncharted territory" and need to expect to need to do some heavy duty
trial and error tuning on your own.

-- Jack Krupansky

-----Original Message----- From: Tim Vaillancourt
Sent: Saturday, July 27, 2013 4:21 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.3.1 - "Failure to open existing log file (non
fatal)" errors under high load


Thanks for the reply Erick,

Hard Commit - 15000ms, openSearcher=false
Soft Commit - 1000ms, openSearcher=true

15sec hard commit was sort of a guess, I could try a smaller number.
When you say "getting too large" what limit do you think it would be
hitting: a ulimit (nofiles), disk space, number of changes, a limit in
Solr itself?

By my math there would be 15 tlogs max per core, but I don't really know
how it all works if someone could fill me in/point me somewhere.

Cheers,

Tim

On 27/07/13 07:57 AM, Erick Erickson wrote:

What is your autocommit limit? Is it possible that your transaction
logs are simply getting too large? tlogs are truncated whenever
you do a hard commit (autocommit) with openSearcher either
true for false it doesn't matter.....

FWIW,
Erick

On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourt<t...@elementspace.com>
wrote:

Thanks Shawn and Yonik!

Yonik: I noticed this error appears to be fairly trivial, but it is not
appearing after a previous crash. Every time I run this high-volume test
that produced my stack trace, I zero out the logs, Solr data and
Zookeeper
data and start over from scratch with a brand new collection and zero'd
out
logs.

The test is mostly high volume (2000-4000 updates/sec) and at the start
the
SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs
at
all. Then that stack trace occurs on all 3 nodes (staggered), I
immediately
get some replica down messages and then some "cannot connect" errors to
all
other cluster nodes, who have all crashed the same way. The tlog error
could
be a symptom of the problem of running out of threads perhaps.

Shawn: thanks so much for sharing those details! Yes, they seem to be
nice
servers, for sure - I don't get to touch/see them but they're fast! I'll
look into firmwares for sure and will try again after updating them.
These
Solr instances are not-bare metal and are actually KVM VMs so that's
another
layer to look into, although it is consistent between the two clusters.

I am not currently increasing the 'nofiles' ulimit to above default like
you
are, but does Solr use 10,000+ file handles? It won't hurt to try it I
guess
:). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an
experiment as well.

Thanks!

Tim


On 25/07/13 05:55 PM, Yonik Seeley wrote:

On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourt<t...@elementspace.com>
wrote:

"ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
Failure to open existing log file (non fatal)

That itself isn't necessarily a problem (and why it says "non fatal")
- it just means that most likely the a transaction log file was
truncated from a previous crash.  It may be unrelated to the other
issues you are seeing.

-Yonik
http://lucidworks.com

Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

Reply via email to