Thanks for all the details.
Everytime I go back to this article and everytime I learn something new (or 
should I say I remember something that I had forgotten!).

The scenario you are describing could match our experience except the last step 
"you stop indexing entirely and the tlog never gets rotated".
It's very likely that we stopped indexing for a significant amount of time 
(maybe ~1hour) but after I'm quite sure that there was some indexation going on 
at a lower rate and this didn't trigger anything.

Sorry if I'm not answering some of your questions, I'm trying to reproduce and 
test some stuff before giving some definitive answers and apply your 
suggestions.

I can give you some additionals info though:
- We are working with quite small documents (<5KB)
- On a node that is currently having the issue, I tried to force a commit 
(http://.../update?commit=true) and nothing is happening from the tlog POV: it 
still is very large and constantly growing.

I put the logs in DEBUG through the Solr UI and grepped "commit" at the time of 
the forced commit and here's what we have:
  "method": "processCommit",
  "message": "PRE_UPDATE 
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
 {commit=true&update.chain=files-update-processor&df=_text_}",
  "method": "distribCommit",
  "message": "Distrib commit to: [] params: 
update.chain=files-update-processor&update.distrib=TOLEADER&commit_end_point=leaders&distrib.from=http://srv1:8083/solr/col_blue_shard1/";,
  "method": "distribCommit",
  "message": "Distrib commit to: [StdNode: 
http://srv2:8083/solr/col_blue_shard1/] params: 
update.chain=files-update-processor&update.distrib=FROMLEADER&commit_end_point=replicas&distrib.from=http://srv2:8083/solr/col_blue_shard1/";,
  "message": "sending update to http://srv2:8083/solr/col_blue_shard1/ retry:0 
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
 
params:update.chain=files-update-processor&update.distrib=FROMLEADER&commit_end_point=replicas&distrib.from=http://srv1:8083/solr/col_blue_shard1/&commit=true&softCommit=false&waitSearcher=true&expungeDeletes=false&openSearcher=true";,
  "method": "doLocalCommit",
  "message": "Ignoring commit while not ACTIVE - state: BUFFERING replay: 
false",
  "message": "PRE_UPDATE FINISH 
{commit=true&update.chain=files-update-processor&df=_text_}",
  "message": "[col_blue_shard1]  webapp=/shoppingfeedses path=/update 
params={commit=true}{commit=} 0 119",
  "message": "Closing out SolrRequest: 
{commit=true&update.chain=files-update-processor&df=_text_}",

"Ignoring commit while not ACTIVE" feels strange but maybe It's doesn't mean 
what I think it means.
Do you see something strange in there?


Kind regards,
Gaël


De : Erick Erickson <erickerick...@gmail.com>
Envoyé : jeudi 23 juillet 2020 13:52
À : solr-user@lucene.apache.org <solr-user@lucene.apache.org>
Objet : Re: tlog keeps growing 
 
Hmmm, this doesn't account for the tlog growth, but a 15 minute hard
commit is excessive and accounts for your down time on restart if Solr
is forcefully shut down. I’d shorten it to a minute or less. You also
shouldn’t have any replay if you shut down your Solr gracefully.


Here’s lots of background:

https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

It is _vaguely_ possible (and I’m really reaching here) that if you’re indexing 
at a
rapid rate when you start Solr. Here’s the scenario:

- Solr un-gracefully shuts down
- you start Solr and it starts replaying the tlog
- more documents come in and are, guess what, written to the tlog
- the tlog grows as long as this goes on.
- you stop indexing entirely and the tlog never gets rotated.

I consider this unlikely-but-possible. One implication here is that it is _not_
a requirement that a replica be “active” to start accumulating docs in the
tlog, they certainly can grow when a replica is recovering. There should also
be messages about starting tlog replay (although I don’t remember the
exact wording).

When you shut Solr down, do you get any message on the screen about 
“forcefully killing Solr”? If so, the tlog gets replayed on startup. This 
shouldn’t
happen and if it is it’d be good to find out why...

Also note that the contract is that enough closed tlogs are kept around to
satisfy numRecordsToKeep. So in this scenario, your large tlog will remain
until enough _new_ tlogs are created to contain over 1,000 documents.

What none of this really explains is why the tlog goes away on restart.
Does it disappear immediately or does it disappear after a while (i.e.
after you index another 1,000 documents and it can be rolled off)?

BTW, numRecordsToKeep really only is useful for “peer sync”. If a follower
needs to recover and there are enough records in the tlog to catch it up
from when it started to recover, the records are sent from the tlog
rather than the full index replicating. If you’re not getting replicas going
into recovery (except of course on startup), this isn’t doing you much
good and I’d reduce it (the default is 100).

Yeah, the open searcher is the soft commit. Do you see any references to
“commit” or “commitTracker”? Some of those will be the autocommit kicking
in, and there should be info in the message about whether a new searcher
is being opened, which will indicat whether it;s a hard or soft commit.

So I’d shorten your hard commit interval significantly. That shouldn’t matter, 
or
more correctly the large tlog should eventually clear itself out but it’ll 
provide
some data.

One thing to check is if, and assuming you’re indexing docs, does a new tlog 
get created 15+ minutes later (and after the replica is active)? If so, I’d 
expect
the large tlog to be deleted 1,000 docs later.

Best,
Erick

> On Jul 22, 2020, at 4:14 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> I’m assuming you do not have CDCR configured, correct?
> 
> This is weird. Every hard commit should close the current tlog, open a new 
> one and delete old ones respecting numRecordsToKeep.
> 
> Are these NRT replicas or TLOG replicas? That shouldn’t make a lot of 
> difference, but might be a clue.
> 
> Your solr log in the one with 20G tlogs should show commits, is there 
> anything that points up?
> 
> It’s also a bit weird that the numbers are so very different. While not 
> lock-step, I’d expect that they were reasonably close. When you restart the 
> server, does Solr roll over the logs for some period or does it just start 
> accumulating the tlog?
> 
> Are both replicas in the “active” state? And is the replica with the large 
> tlogs the follower or the leader?
> 
> Mainly asking a bunch of questions because I haven’t seen this happen, the 
> answers to the above might give a clue where to look next.
> 
> Best,
> Erick
> 
>> On Jul 22, 2020, at 3:39 PM, Gael Jourdan-Weil 
>> <gael.jourdan-w...@kelkoogroup.com> wrote:
>> 
>> Hello,
>> 
>> I'm facing a situation where a transaction log file keeps growing and is 
>> never deleted.
>> 
>> The setup is as follow:
>> - Solr 8.4.1
>> - SolrCloud with 2 nodes
>> - 1 collection, 1 shard
>> 
>> On one of the node I can see the tlog files having the expected behavior, 
>> that is new tlog files being created and old ones being deleted at a 
>> frequency that matches the autocommit settings.
>> For instance, there is currently two files tlog.0000000000000003226 and 
>> tlog.0000000000000003227, each of them is around 1G (size).
>> 
>> But on the other node, I see two files tlog.0000000000000000298 and 
>> tlog.0000000000000000299, the later being now 20G and has been created 10 
>> hours ago.
>> 
>> It already happened a few times, restarting the server seems to make things 
>> go right but it's obviously not a durable solution.
>> 
>> Do you have any idea what could cause this behavior?
>> 
>> solrconfig.xml:
>> <updateHandler class="solr.DirectUpdateHandler2">
>>   <updateLog>
>>     <str name="dir">${solr.ulog.dir:}</str>
>>     <int name="numRecordsToKeep">1000</int>
>>     <int name="maxNumLogsToKeep">100</int>
>>   </updateLog>
>>   <autoCommit>
>>     <maxTime>900000</maxTime>
>>     <openSearcher>false</openSearcher>
>>   </autoCommit>
>>   <autoSoftCommit>
>>       <maxTime>180000</maxTime>
>>   </autoSoftCommit>
>> </updateHandler>
>> 
>> Kind regards,
>> Gaël
>> 
> 

Reply via email to