No hard numbers, but the general guidance is that you should set your hard commit interval to match your expectations for how quickly nodes should come up if they need to be restarted. Specifically, a hard commit assures that all changes have been committed to disk and are ready for immediate access on restart, but any and all soft commit changes since the last hard commit must be "replayed" (reexecuted) on restart of a node.

How long does it take to replay the changes in the update log? No firm numbers, but treat it as if all of those uncommitted updates had to be resent and reprocessed by Solr. It's probably faster than that, but you get the picture.

I would suggest thinking in terms of minutes rather than seconds for hard commits 5 minutes, 10, 15, 20, 30 minutes.

Hard commits may result in kicking off segment merges, so too rapid a rate of segment creation might cause problems or at least be counterproductive.

So, instead of 15 seconds, try 15 minutes.

OTOH, if you really need to handle 4,000 update a seconds... you are clearly in "uncharted territory" and need to expect to need to do some heavy duty trial and error tuning on your own.

-- Jack Krupansky

-----Original Message----- From: Tim Vaillancourt
Sent: Saturday, July 27, 2013 4:21 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

Thanks for the reply Erick,

Hard Commit - 15000ms, openSearcher=false
Soft Commit - 1000ms, openSearcher=true

15sec hard commit was sort of a guess, I could try a smaller number.
When you say "getting too large" what limit do you think it would be
hitting: a ulimit (nofiles), disk space, number of changes, a limit in
Solr itself?

By my math there would be 15 tlogs max per core, but I don't really know
how it all works if someone could fill me in/point me somewhere.

Cheers,

Tim

On 27/07/13 07:57 AM, Erick Erickson wrote:
What is your autocommit limit? Is it possible that your transaction
logs are simply getting too large? tlogs are truncated whenever
you do a hard commit (autocommit) with openSearcher either
true for false it doesn't matter.....

FWIW,
Erick

On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourt<t...@elementspace.com> wrote:
Thanks Shawn and Yonik!

Yonik: I noticed this error appears to be fairly trivial, but it is not
appearing after a previous crash. Every time I run this high-volume test
that produced my stack trace, I zero out the logs, Solr data and Zookeeper data and start over from scratch with a brand new collection and zero'd out
logs.

The test is mostly high volume (2000-4000 updates/sec) and at the start the SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs at all. Then that stack trace occurs on all 3 nodes (staggered), I immediately get some replica down messages and then some "cannot connect" errors to all other cluster nodes, who have all crashed the same way. The tlog error could
be a symptom of the problem of running out of threads perhaps.

Shawn: thanks so much for sharing those details! Yes, they seem to be nice
servers, for sure - I don't get to touch/see them but they're fast! I'll
look into firmwares for sure and will try again after updating them. These Solr instances are not-bare metal and are actually KVM VMs so that's another
layer to look into, although it is consistent between the two clusters.

I am not currently increasing the 'nofiles' ulimit to above default like you are, but does Solr use 10,000+ file handles? It won't hurt to try it I guess
:). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an
experiment as well.

Thanks!

Tim


On 25/07/13 05:55 PM, Yonik Seeley wrote:
On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourt<t...@elementspace.com>
wrote:
"ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
Failure to open existing log file (non fatal)

That itself isn't necessarily a problem (and why it says "non fatal")
- it just means that most likely the a transaction log file was
truncated from a previous crash.  It may be unrelated to the other
issues you are seeing.

-Yonik
http://lucidworks.com

Reply via email to