Hmmm, now we’re getting somewhere. Here’s the code block in DistributedUpdateProcessor
if (ulog == null || ulog.getState() == UpdateLog.State.ACTIVE || (cmd.getFlags() & UpdateCommand.REPLAY) != 0) { super.processCommit(cmd); } else { if (log.isInfoEnabled()) { log.info("Ignoring commit while not ACTIVE - state: {} replay: {}" , ulog.getState(), ((cmd.getFlags() & UpdateCommand.REPLAY) != 0)); } } Why you’re buffering is the mystery. Note that for my previous e-mail you’d have to wait 15 minutes after you started indexing to see a new tlog and also wait until at least 1,000 new document after _that_ before the large tlog went away. I don't think that’s your issue though. On a _very_ quick look at the code (and this is not code I’m intimately familiar with), the only time the state should be BUFFERING is if the node is in recovery. Once recovery is complete, the tlog state should change. So I think that’s the place to focus. Did the node recover completely and go active? Just checking the admin UI and seeing it be green is sometimes not enough. Check the state.json znode and see if the state is also “active” there. Next, try sending a request directly to that replica. Frankly I’m not sure what to expect, but if you get something weird that’d be a “smoking gun” that no matter what the admin UI says, the replica isn’t really active. Something like “http://blah blah blah/solr/collection1_shard1_replica_n1?q=some_query&distrib=false. The “distrib=false” is important, otherwise the request will be forwarded to a truly active node. I’d tail the log on that replica at the same time to gather clues. Your Solr log should also indicate that the replica went into recovery and, eventually, completed. The scenario seems to be that - the replica goes into recovery - the replica either never catches up _or_ it would eventually catch up but is processing so much data that it just seems like it’s stuck. If the replica never catches up, especially if you slow down/stop indexing, that’s certainly a bug. In days long ago the tlog replay could be very inefficient, but that hasn’t been the case since well before 8.4. Regressions are always possible of course. Since it’s expected that the tlog will grow until recovery is complete, it feels like this is somewhat on the right track. You should see some message at WARN level like "Starting log replay…” and "Log replay finished…” and INFO level messages every 1,000 docs replayed like "log replay status…" I’d be grepping my log for anything that mentions “replay” (case-insensitive!) If you’re interested in code spelunking, see LogReplayer, run and doReplay in UpdateLog.java are where you can find the messages I’d epect to see in the log. If you want to enable DEBUG level for UpdateLog.java you’ll see info about the individual entries from the tlog that are replayed, but I’d only go there if the progress every 1,000 docs doesn’t show anything useful. Good Luck! Erick > On Jul 23, 2020, at 10:03 AM, Gael Jourdan-Weil > <gael.jourdan-w...@kelkoogroup.com> wrote: > > Ignoring commit while not ACTIVE - state: BUFFERING