Hi, Before I proceed to create a Solr Jira issue I'll ask for some feedback here first. I did start a thread on the users mailing list a few days ago about this topic without any relevant feedback (https://lists.apache.org/thread/yv5d869tzckdgp6py83lqrgtpqz0xqvv) and I have done some more investigation since then. That thread does contain some log file snippets if you are interested in a bit more context.
In a Solr 8.11 master/replica setup where documents in multiple Solr cores are updated at the same time on the master there are very large delays in the replica between replicating data for the first core and subsequent cores. This seems to stem from some mutex locking that blocks threads that operate on DIFFERENT cores, rather than only blocking threads that operate on the SAME core. A "replication" can be divided into a couple of milestones for the IndexFetcher thread(s) on the Solr replica: 1. Notices the leader has a new version and logs "Starting replication process" 2. Creates a new IndexWriter (createMainIndexWriter) 3. Downloads the index updates from the master 4. Creates a new IndexWriter again!? (createMainIndexWriter) 5. Calls openNewSearcher 6. Gets new commit point and is done The above activities should (as far as I know) run independently for each core so that each core can perform replication concurrently with other cores but they clearly do not. Currently updates on multiple cores on the master will cause IndexFetcher threads on the replica to block each other and the problem seems to quickly get worse the more cores you update at the same time. This can make it extremely difficult to reach a sensible indexing latency service level agreement on the replicas since the blocking can add several minutes of delay. The biggest culprits code wise seems to be: * createMainIndexWriter is prevented from returning while another replication thread is active. Any mutex locks used in such code should be core specific, not global. * openNewSearcher can take a fairly long time to return (perhaps due to cache warming?) but it also seems to block other replication threads from progressing. Opening a new searcher should not hold any locks that are related to replication, especially not any locks related to other Solr cores. As an example where I update documents in 3 cores the first IndexFetcher thread spent 12s between logging "Starting replication process" until it was done replicating and resumed its normal behavior of checking Leader vs Follower versions, the second spent 1m38s, and the third spent 2m49s. Does this ring any bells in terms of existing known issues or known pitfalls regarding why Solr's replication threads must be synchronized in the way they are? Depending on how feasible I think it is I may try to provide a merge request here but it would be good to at least have confirmation that the intended behavior for replication is that each core should replicate independently. Any pointers regarding specific mutexes that might need to be replaced by more core specific ones (and suggestions on what an existing good mutex is) would be helpful. Kind regards, Marcus