Hmmm, now we’re getting somewhere. Here’s the code block in 
DistributedUpdateProcessor

if (ulog == null || ulog.getState() == UpdateLog.State.ACTIVE || 
(cmd.getFlags() & UpdateCommand.REPLAY) != 0) {
  super.processCommit(cmd);
} else {
  if (log.isInfoEnabled()) {
    log.info("Ignoring commit while not ACTIVE - state: {} replay: {}"
        , ulog.getState(), ((cmd.getFlags() & UpdateCommand.REPLAY) != 0));
  }
}

Why you’re buffering is the mystery.

Note that for my previous e-mail you’d have to wait 15 minutes after you 
started indexing to see a new tlog and also wait until at least 1,000 new 
document after _that_ before the large tlog went away. I don't think that’s 
your issue though.

On a _very_ quick look at the code (and this is not code I’m intimately 
familiar with), the only time the state should be BUFFERING is if the node is 
in recovery. Once recovery is complete, the tlog state should change.

So I think that’s the place to focus. Did the node recover completely and go 
active? Just checking the admin UI and seeing it be green is sometimes not 
enough. Check the state.json znode and see if the state is also “active” there.

Next, try sending a request directly to that replica. Frankly I’m not sure what 
to expect, but if you get something weird that’d be a “smoking gun” that no 
matter what the admin UI says, the replica isn’t really active. Something like 
“http://blah blah 
blah/solr/collection1_shard1_replica_n1?q=some_query&distrib=false. The 
“distrib=false” is important, otherwise the request will be forwarded to a 
truly active node.

I’d tail the log on that replica at the same time to gather clues.

Your Solr log should also indicate that the replica went into recovery and, 
eventually, completed. The scenario seems to be that 
- the replica goes into recovery
- the replica either never catches up _or_ it would eventually catch up but is 
processing so much data that it just seems like it’s stuck.

If the replica never catches up, especially if you slow down/stop indexing, 
that’s certainly a bug. In days long ago the tlog replay could be very 
inefficient, but that hasn’t been the case since well before 8.4. Regressions 
are always possible of course.

Since it’s expected that the tlog will grow until recovery is complete, it 
feels like this is somewhat on the right track.

You should see some message at WARN level like 
"Starting log replay…” and "Log replay finished…”

and INFO level messages every 1,000 docs replayed like
"log replay status…"

I’d be grepping my log for anything that mentions “replay” (case-insensitive!) 
If you’re interested in code spelunking, see LogReplayer, run and doReplay in 
UpdateLog.java are where you can find the messages I’d epect to see in the log.

If you want to enable DEBUG level for UpdateLog.java you’ll see info about the 
individual entries from the tlog that are replayed, but I’d only go there if 
the progress every 1,000 docs doesn’t show anything useful.

Good Luck!
Erick

> On Jul 23, 2020, at 10:03 AM, Gael Jourdan-Weil 
> <gael.jourdan-w...@kelkoogroup.com> wrote:
> 
> Ignoring commit while not ACTIVE - state: BUFFERING

Reply via email to