Hi,
Before I proceed to create a Solr Jira issue I'll ask for some feedback here 
first. I did start a thread on the users mailing list a few days ago about this 
topic without any relevant feedback 
(https://lists.apache.org/thread/yv5d869tzckdgp6py83lqrgtpqz0xqvv) and I have 
done some more investigation since then. That thread does contain some log file 
snippets if you are interested in a bit more context.

In a Solr 8.11 master/replica setup where documents in multiple Solr cores are 
updated at the same time on the master there are very large delays in the 
replica between replicating data for the first core and subsequent cores. This 
seems to stem from some mutex locking that blocks threads that operate on 
DIFFERENT cores, rather than only blocking threads that operate on the SAME 
core. A "replication" can be divided into a couple of milestones for the 
IndexFetcher thread(s) on the Solr replica:


  1.
Notices the leader has a new version and logs "Starting replication process"
  2.
Creates a new IndexWriter (createMainIndexWriter)
  3.
Downloads the index updates from the master
  4.
Creates a new IndexWriter again!? (createMainIndexWriter)
  5.
Calls openNewSearcher
  6.
Gets new commit point and is done

The above activities should (as far as I know) run independently for each core 
so that each core can perform replication concurrently with other cores but 
they clearly do not. Currently updates on multiple cores on the master will 
cause IndexFetcher threads on the replica to block each other and the problem 
seems to quickly get worse the more cores you update at the same time. This can 
make it extremely difficult to reach a sensible indexing latency service level 
agreement on the replicas since the blocking can add several minutes of delay. 
The biggest culprits code wise seems to be:


  *
createMainIndexWriter is prevented from returning while another replication 
thread is active. Any mutex locks used in such code should be core specific, 
not global.
  *
openNewSearcher can take a fairly long time to return (perhaps due to cache 
warming?) but it also seems to block other replication threads from 
progressing. Opening a new searcher should not hold any locks that are related 
to replication, especially not any locks related to other Solr cores.

As an example where I update documents in 3 cores the first IndexFetcher thread 
spent 12s between logging "Starting replication process" until it was done 
replicating and resumed its normal behavior of checking Leader vs Follower 
versions, the second spent 1m38s, and the third spent 2m49s.

Does this ring any bells in terms of existing known issues or known pitfalls 
regarding why Solr's replication threads must be synchronized in the way they 
are?

Depending on how feasible I think it is I may try to provide a merge request 
here but it would be good to at least have confirmation that the intended 
behavior for replication is that each core should replicate independently. Any 
pointers regarding specific mutexes that might need to be replaced by more core 
specific ones (and suggestions on what an existing good mutex is) would be 
helpful.

Kind regards,

Marcus

Reply via email to