Hmmm, this doesn't account for the tlog growth, but a 15 minute hard commit is excessive and accounts for your down time on restart if Solr is forcefully shut down. I’d shorten it to a minute or less. You also shouldn’t have any replay if you shut down your Solr gracefully.
Here’s lots of background: https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ It is _vaguely_ possible (and I’m really reaching here) that if you’re indexing at a rapid rate when you start Solr. Here’s the scenario: - Solr un-gracefully shuts down - you start Solr and it starts replaying the tlog - more documents come in and are, guess what, written to the tlog - the tlog grows as long as this goes on. - you stop indexing entirely and the tlog never gets rotated. I consider this unlikely-but-possible. One implication here is that it is _not_ a requirement that a replica be “active” to start accumulating docs in the tlog, they certainly can grow when a replica is recovering. There should also be messages about starting tlog replay (although I don’t remember the exact wording). When you shut Solr down, do you get any message on the screen about “forcefully killing Solr”? If so, the tlog gets replayed on startup. This shouldn’t happen and if it is it’d be good to find out why... Also note that the contract is that enough closed tlogs are kept around to satisfy numRecordsToKeep. So in this scenario, your large tlog will remain until enough _new_ tlogs are created to contain over 1,000 documents. What none of this really explains is why the tlog goes away on restart. Does it disappear immediately or does it disappear after a while (i.e. after you index another 1,000 documents and it can be rolled off)? BTW, numRecordsToKeep really only is useful for “peer sync”. If a follower needs to recover and there are enough records in the tlog to catch it up from when it started to recover, the records are sent from the tlog rather than the full index replicating. If you’re not getting replicas going into recovery (except of course on startup), this isn’t doing you much good and I’d reduce it (the default is 100). Yeah, the open searcher is the soft commit. Do you see any references to “commit” or “commitTracker”? Some of those will be the autocommit kicking in, and there should be info in the message about whether a new searcher is being opened, which will indicat whether it;s a hard or soft commit. So I’d shorten your hard commit interval significantly. That shouldn’t matter, or more correctly the large tlog should eventually clear itself out but it’ll provide some data. One thing to check is if, and assuming you’re indexing docs, does a new tlog get created 15+ minutes later (and after the replica is active)? If so, I’d expect the large tlog to be deleted 1,000 docs later. Best, Erick > On Jul 22, 2020, at 4:14 PM, Erick Erickson <erickerick...@gmail.com> wrote: > > I’m assuming you do not have CDCR configured, correct? > > This is weird. Every hard commit should close the current tlog, open a new > one and delete old ones respecting numRecordsToKeep. > > Are these NRT replicas or TLOG replicas? That shouldn’t make a lot of > difference, but might be a clue. > > Your solr log in the one with 20G tlogs should show commits, is there > anything that points up? > > It’s also a bit weird that the numbers are so very different. While not > lock-step, I’d expect that they were reasonably close. When you restart the > server, does Solr roll over the logs for some period or does it just start > accumulating the tlog? > > Are both replicas in the “active” state? And is the replica with the large > tlogs the follower or the leader? > > Mainly asking a bunch of questions because I haven’t seen this happen, the > answers to the above might give a clue where to look next. > > Best, > Erick > >> On Jul 22, 2020, at 3:39 PM, Gael Jourdan-Weil >> <gael.jourdan-w...@kelkoogroup.com> wrote: >> >> Hello, >> >> I'm facing a situation where a transaction log file keeps growing and is >> never deleted. >> >> The setup is as follow: >> - Solr 8.4.1 >> - SolrCloud with 2 nodes >> - 1 collection, 1 shard >> >> On one of the node I can see the tlog files having the expected behavior, >> that is new tlog files being created and old ones being deleted at a >> frequency that matches the autocommit settings. >> For instance, there is currently two files tlog.0000000000000003226 and >> tlog.0000000000000003227, each of them is around 1G (size). >> >> But on the other node, I see two files tlog.0000000000000000298 and >> tlog.0000000000000000299, the later being now 20G and has been created 10 >> hours ago. >> >> It already happened a few times, restarting the server seems to make things >> go right but it's obviously not a durable solution. >> >> Do you have any idea what could cause this behavior? >> >> solrconfig.xml: >> <updateHandler class="solr.DirectUpdateHandler2"> >> <updateLog> >> <str name="dir">${solr.ulog.dir:}</str> >> <int name="numRecordsToKeep">1000</int> >> <int name="maxNumLogsToKeep">100</int> >> </updateLog> >> <autoCommit> >> <maxTime>900000</maxTime> >> <openSearcher>false</openSearcher> >> </autoCommit> >> <autoSoftCommit> >> <maxTime>180000</maxTime> >> </autoSoftCommit> >> </updateHandler> >> >> Kind regards, >> Gaël >> >