Re: Integration test failures on Jenkins
The failures are ongoing, so I suspect no one has (yet) reached out to Infra about this. I pinged them this morning in #asfinfra, so hopefully we can get this resolved shortly. - To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org
Solr replication threads blocking each other causing large delays
Hi, Before I proceed to create a Solr Jira issue I'll ask for some feedback here first. I did start a thread on the users mailing list a few days ago about this topic without any relevant feedback (https://lists.apache.org/thread/yv5d869tzckdgp6py83lqrgtpqz0xqvv) and I have done some more investigation since then. That thread does contain some log file snippets if you are interested in a bit more context. In a Solr 8.11 master/replica setup where documents in multiple Solr cores are updated at the same time on the master there are very large delays in the replica between replicating data for the first core and subsequent cores. This seems to stem from some mutex locking that blocks threads that operate on DIFFERENT cores, rather than only blocking threads that operate on the SAME core. A "replication" can be divided into a couple of milestones for the IndexFetcher thread(s) on the Solr replica: 1. Notices the leader has a new version and logs "Starting replication process" 2. Creates a new IndexWriter (createMainIndexWriter) 3. Downloads the index updates from the master 4. Creates a new IndexWriter again!? (createMainIndexWriter) 5. Calls openNewSearcher 6. Gets new commit point and is done The above activities should (as far as I know) run independently for each core so that each core can perform replication concurrently with other cores but they clearly do not. Currently updates on multiple cores on the master will cause IndexFetcher threads on the replica to block each other and the problem seems to quickly get worse the more cores you update at the same time. This can make it extremely difficult to reach a sensible indexing latency service level agreement on the replicas since the blocking can add several minutes of delay. The biggest culprits code wise seems to be: * createMainIndexWriter is prevented from returning while another replication thread is active. Any mutex locks used in such code should be core specific, not global. * openNewSearcher can take a fairly long time to return (perhaps due to cache warming?) but it also seems to block other replication threads from progressing. Opening a new searcher should not hold any locks that are related to replication, especially not any locks related to other Solr cores. As an example where I update documents in 3 cores the first IndexFetcher thread spent 12s between logging "Starting replication process" until it was done replicating and resumed its normal behavior of checking Leader vs Follower versions, the second spent 1m38s, and the third spent 2m49s. Does this ring any bells in terms of existing known issues or known pitfalls regarding why Solr's replication threads must be synchronized in the way they are? Depending on how feasible I think it is I may try to provide a merge request here but it would be good to at least have confirmation that the intended behavior for replication is that each core should replicate independently. Any pointers regarding specific mutexes that might need to be replaced by more core specific ones (and suggestions on what an existing good mutex is) would be helpful. Kind regards, Marcus
Re: Integration test failures on Jenkins
Sorry, got busy with something and this fell off my radar. Will work with Jason. - Houston On Tue, Jun 25, 2024 at 7:43 AM Jason Gerlowski wrote: > The failures are ongoing, so I suspect no one has (yet) reached out to > Infra about this. > > I pinged them this morning in #asfinfra, so hopefully we can get this > resolved shortly. > > - > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org > >
lucene-solr-1 Jenkins Agent Management
Hey all, Sending this email to discuss the ASF Jenkins' 'lucene-solr-1' worker node, which both Lucene and Solr use to run builds. While investigating a recent issue with some Solr builds, I asked INFRA to restart 'lucene-solr-1'. They were happy to oblige (and I'm now unblocked), but in the process they mentioned that 'lucene-solr-1' was a "project run VM" and that someone in the project should have the requisite access and permissions. [1] [2] This was all a surprise to me, so I wanted to follow up here with a few questions. 1. Is it true that our projects run or manage this VM in some way? 2. If so, does anyone remember the context around how or why this came about? (As opposed to using one of the "standard" Jenkins build machines that INFRA provides...) 3. Who all can access the box? How are credentials managed? (My suspicion is that no one currently active on the Solr lists is able to access the box, which seems worth remedying...) Best, Jason [1] https://lists.apache.org/thread/g0y0ohczdctv0v5fn8sqv4t0j4y6hp68 [2] https://the-asf.slack.com/archives/CBX4TSBQ8/p1719318685682309 - To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org