Hello folks, We see similar behavior from time to time. The main difference seems to be that you see it while using NRT replication and we see it while using TLOG replication.
* Solr 7.5.0. * 1 collection with 12 shards, each with 2 TLOG and 2 PULL replicas. * 12 machines, each machine hosting one node/JVM. Each node contains 4 replicas (different shards). * No explicit commits in the update requests. AutoCommit=15s, AutoSoftCommit=1s. The symptoms we observe are as follows: * It's on a TLOG replica that is not currently the leader. * For that replica, there is a single transaction log that keeps on growing. * For that replica, new segments are not being fetched from that shard's TLOG leader. In this configuration, one node contains four TLOG cores. We have observed the problem occurring on a single one of the cores as well as on multiple cores in one node. Anecdotally, it seems to occur more frequently on those collections that are large (number of documents, size on disk) and that have a higher ingest rate. These are vague terms and I don't know that I'm allowed to share specifics, but I can say that we run a number of different clouds with a similar setup and that this problem occurs more frequently for the more loaded clouds. Initially we couldn't tell that this was occurring (queries were directed to PULL replicas so not evident to the applications, the TLOG cores with this problem reported as active and in a good state so nothing obviously wrong). Our early alarm system now consists of checking for large transaction logs. When we see this, we restart the problem node. Upon restart it recovers from its leader (fetching whatever segments it had missed - hours, days, ...). Eventually the large transaction log disappears and that core starts to cycle through a series of smaller transaction logs (the normal behavior). We noticed that the ctime of the large transaction log seemed to be slightly later than that node's restart time. After that discovery, we saw the following pattern every time we observed the problem: * Everything is in a good state at the beginning. A is a node containing a TLOG replica that is leader for its shard and B is a node containing a TLOG replica that is not a leader. Ingest is ongoing. * A is stopped for a short period of time (< 30s) and then is started up again. If it makes a difference, our way of stopping this relies on systemd's default behavior - send SIGTERM, wait for 5s, send SIGKILL. * The TLOG replica in B emerges from this as the leader for its shard. Everything about B appears to suggest B is operating correctly. The TLOG replica in A has the ever-growing transaction log and never fetches new segments. * The malfunctioning TLOG replica in A can be "fixed" by restarting A. * As noted earlier, this can affect cores (in a single node) individually. It can be a problem for one and not the others (or for all of the cores in a node). It was suggested to us that this might be https://issues.apache.org/jira/browse/SOLR-13486. On Wed, Jul 22, 2020 at 3:42 PM Gael Jourdan-Weil < gael.jourdan-w...@kelkoogroup.com> wrote: > Hello, > > I'm facing a situation where a transaction log file keeps growing and is > never deleted. > > The setup is as follow: > - Solr 8.4.1 > - SolrCloud with 2 nodes > - 1 collection, 1 shard > > On one of the node I can see the tlog files having the expected behavior, > that is new tlog files being created and old ones being deleted at a > frequency that matches the autocommit settings. > For instance, there is currently two files tlog.0000000000000003226 and > tlog.0000000000000003227, each of them is around 1G (size). > > But on the other node, I see two files tlog.0000000000000000298 and > tlog.0000000000000000299, the later being now 20G and has been created 10 > hours ago. > > It already happened a few times, restarting the server seems to make > things go right but it's obviously not a durable solution. > > Do you have any idea what could cause this behavior? > > solrconfig.xml: > <updateHandler class="solr.DirectUpdateHandler2"> > <updateLog> > <str name="dir">${solr.ulog.dir:}</str> > <int name="numRecordsToKeep">1000</int> > <int name="maxNumLogsToKeep">100</int> > </updateLog> > <autoCommit> > <maxTime>900000</maxTime> > <openSearcher>false</openSearcher> > </autoCommit> > <autoSoftCommit> > <maxTime>180000</maxTime> > </autoSoftCommit> > </updateHandler> > > Kind regards, > Gaƫl > >