Hello folks,

We see similar behavior from time to time.  The main difference seems to be
that you see it while using NRT replication and we see it while using TLOG
replication.

* Solr 7.5.0.
* 1 collection with 12 shards, each with 2 TLOG and 2 PULL replicas.
* 12 machines, each machine hosting one node/JVM.  Each node contains 4
replicas (different shards).
* No explicit commits in the update requests.  AutoCommit=15s,
AutoSoftCommit=1s.

The symptoms we observe are as follows:
* It's on a TLOG replica that is not currently the leader.
* For that replica, there is a single transaction log that keeps on growing.
* For that replica, new segments are not being fetched from that shard's
TLOG leader.

In this configuration, one node contains four TLOG cores.  We have observed
the problem occurring on a single one of the cores as well as on multiple
cores in one node.

Anecdotally, it seems to occur more frequently on those collections that
are large (number of documents, size on disk) and that have a higher ingest
rate.  These are vague terms and I don't know that I'm allowed to share
specifics, but I can say that we run a number of different clouds with a
similar setup and that this problem occurs more frequently for the more
loaded clouds.

Initially we couldn't tell that this was occurring (queries were directed
to PULL replicas so not evident to the applications, the TLOG cores with
this problem reported as active and in a good state so nothing obviously
wrong).  Our early alarm system now consists of checking for large
transaction logs.  When we see this, we restart the problem node.  Upon
restart it recovers from its leader (fetching whatever segments it had
missed - hours, days, ...).  Eventually the large transaction log
disappears and that core starts to cycle through a series of smaller
transaction logs (the normal behavior).

We noticed that the ctime of the large transaction log seemed to be
slightly later than that node's restart time.  After that discovery, we saw
the following pattern every time we observed the problem:
* Everything is in a good state at the beginning.  A is a node containing a
TLOG replica that is leader for its shard and B is a node containing a TLOG
replica that is not a leader.  Ingest is ongoing.
* A is stopped for a short period of time (< 30s) and then is started up
again.  If it makes a difference, our way of stopping this relies on
systemd's default behavior - send SIGTERM, wait for 5s, send SIGKILL.
* The TLOG replica in B emerges from this as the leader for its shard.
Everything about B appears to suggest B is operating correctly.  The TLOG
replica in A has the ever-growing transaction log and never fetches new
segments.
* The malfunctioning TLOG replica in A can be "fixed" by restarting A.
* As noted earlier, this can affect cores (in a single node) individually.
It can be a problem for one and not the others (or for all of the cores in
a node).

It was suggested to us that this might be
https://issues.apache.org/jira/browse/SOLR-13486.

On Wed, Jul 22, 2020 at 3:42 PM Gael Jourdan-Weil <
gael.jourdan-w...@kelkoogroup.com> wrote:

> Hello,
>
> I'm facing a situation where a transaction log file keeps growing and is
> never deleted.
>
> The setup is as follow:
> - Solr 8.4.1
> - SolrCloud with 2 nodes
> - 1 collection, 1 shard
>
> On one of the node I can see the tlog files having the expected behavior,
> that is new tlog files being created and old ones being deleted at a
> frequency that matches the autocommit settings.
> For instance, there is currently two files tlog.0000000000000003226 and
> tlog.0000000000000003227, each of them is around 1G (size).
>
> But on the other node, I see two files tlog.0000000000000000298 and
> tlog.0000000000000000299, the later being now 20G and has been created 10
> hours ago.
>
> It already happened a few times, restarting the server seems to make
> things go right but it's obviously not a durable solution.
>
> Do you have any idea what could cause this behavior?
>
> solrconfig.xml:
>   <updateHandler class="solr.DirectUpdateHandler2">
>     <updateLog>
>       <str name="dir">${solr.ulog.dir:}</str>
>       <int name="numRecordsToKeep">1000</int>
>       <int name="maxNumLogsToKeep">100</int>
>     </updateLog>
>     <autoCommit>
>       <maxTime>900000</maxTime>
>       <openSearcher>false</openSearcher>
>     </autoCommit>
>     <autoSoftCommit>
>         <maxTime>180000</maxTime>
>     </autoSoftCommit>
>   </updateHandler>
>
> Kind regards,
> Gaƫl
>
>

Reply via email to