Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

Erick Erickson Sat, 27 Jul 2013 14:26:19 -0700

Tim:

15 seconds isn't unreasonable, I was mostly wondering if it was hours.


Take a look at the size of the tlogs as you're indexing, you should see them
truncate every 15 seconds or so. There'll be a varying number of tlogs kept
around, although under heavy indexing I'd only expect 1 or 2 inactive ones,
the internal number is that there'll be enough tlogs kept around to
hold 100 docs.

There should only be 1 open tlog/core as I understand it. When a commit
happens (hard, openSearcher = true or false doesn't matter) the current
tlog is closed and a new one opened. Then some cleanup happens so there
are only enough tlogs kept around to hold 100 docs.

Strange, Im kind of out of ideas.
Erick

On Sat, Jul 27, 2013 at 4:41 PM, Jack Krupansky <j...@basetechnology.com> wrote:
> No hard numbers, but the general guidance is that you should set your hard
> commit interval to match your expectations for how quickly nodes should come
> up if they need to be restarted. Specifically, a hard commit assures that
> all changes have been committed to disk and are ready for immediate access
> on restart, but any and all soft commit changes since the last hard commit
> must be "replayed" (reexecuted) on restart of a node.
>
> How long does it take to replay the changes in the update log? No firm
> numbers, but treat it as if all of those uncommitted updates had to be
> resent and reprocessed by Solr. It's probably faster than that, but you get
> the picture.
>
> I would suggest thinking in terms of minutes rather than seconds for hard
> commits 5 minutes, 10, 15, 20, 30 minutes.
>
> Hard commits may result in kicking off segment merges, so too rapid a rate
> of segment creation might cause problems or at least be counterproductive.
>
> So, instead of 15 seconds, try 15 minutes.
>
> OTOH, if you really need to handle 4,000 update a seconds... you are clearly
> in "uncharted territory" and need to expect to need to do some heavy duty
> trial and error tuning on your own.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Tim Vaillancourt
> Sent: Saturday, July 27, 2013 4:21 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud 4.3.1 - "Failure to open existing log file (non
> fatal)" errors under high load
>
>
> Thanks for the reply Erick,
>
> Hard Commit - 15000ms, openSearcher=false
> Soft Commit - 1000ms, openSearcher=true
>
> 15sec hard commit was sort of a guess, I could try a smaller number.
> When you say "getting too large" what limit do you think it would be
> hitting: a ulimit (nofiles), disk space, number of changes, a limit in
> Solr itself?
>
> By my math there would be 15 tlogs max per core, but I don't really know
> how it all works if someone could fill me in/point me somewhere.
>
> Cheers,
>
> Tim
>
> On 27/07/13 07:57 AM, Erick Erickson wrote:
>>
>> What is your autocommit limit? Is it possible that your transaction
>> logs are simply getting too large? tlogs are truncated whenever
>> you do a hard commit (autocommit) with openSearcher either
>> true for false it doesn't matter.....
>>
>> FWIW,
>> Erick
>>
>> On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourt<t...@elementspace.com>
>> wrote:
>>>
>>> Thanks Shawn and Yonik!
>>>
>>> Yonik: I noticed this error appears to be fairly trivial, but it is not
>>> appearing after a previous crash. Every time I run this high-volume test
>>> that produced my stack trace, I zero out the logs, Solr data and
>>> Zookeeper
>>> data and start over from scratch with a brand new collection and zero'd
>>> out
>>> logs.
>>>
>>> The test is mostly high volume (2000-4000 updates/sec) and at the start
>>> the
>>> SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs
>>> at
>>> all. Then that stack trace occurs on all 3 nodes (staggered), I
>>> immediately
>>> get some replica down messages and then some "cannot connect" errors to
>>> all
>>> other cluster nodes, who have all crashed the same way. The tlog error
>>> could
>>> be a symptom of the problem of running out of threads perhaps.
>>>
>>> Shawn: thanks so much for sharing those details! Yes, they seem to be
>>> nice
>>> servers, for sure - I don't get to touch/see them but they're fast! I'll
>>> look into firmwares for sure and will try again after updating them.
>>> These
>>> Solr instances are not-bare metal and are actually KVM VMs so that's
>>> another
>>> layer to look into, although it is consistent between the two clusters.
>>>
>>> I am not currently increasing the 'nofiles' ulimit to above default like
>>> you
>>> are, but does Solr use 10,000+ file handles? It won't hurt to try it I
>>> guess
>>> :). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an
>>> experiment as well.
>>>
>>> Thanks!
>>>
>>> Tim
>>>
>>>
>>> On 25/07/13 05:55 PM, Yonik Seeley wrote:
>>>>
>>>> On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourt<t...@elementspace.com>
>>>> wrote:
>>>>>
>>>>> "ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
>>>>> Failure to open existing log file (non fatal)
>>>>>
>>>> That itself isn't necessarily a problem (and why it says "non fatal")
>>>> - it just means that most likely the a transaction log file was
>>>> truncated from a previous crash.  It may be unrelated to the other
>>>> issues you are seeing.
>>>>
>>>> -Yonik
>>>> http://lucidworks.com
>
>

Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

Reply via email to